Fundamentals of WCS' Public Site Search Infrastructure
Webcenter Sites ships with Apache Lucene.
Apache Lucene drives:
- Asset-oriented search within the Contributor UI. This is automatically set up by default on an standard install. Users can enable / disable it on a per asset type basis as needed.
- Asset-oriented search through a WCS-driven website's search box (or any equivalent feature). Developers must implement website-specific search logic; WCS provides a very basic API for this purpose.
In this post, we will focus on the fundamental aspects regarding Lucene-driven asset-oriented search in WCS-driven websites.
It is off the scope of this article discussing any details related to:
- Back-office asset search.
- Specific use cases of WCS' (Lucene) Search API.
The Big Picture: how are WCS and Lucene integrated?
Asset-specific data is sent for getting re- /de- / indexed whenever an asset is updated / deleted / created, by means of an "internal" asset event listener that is plugged into WCS OOTB.
All 3 scenarios work exactly the same:
- Asset create / update / delete event gets triggered, processed by this asset event listener which based on the asset type and the Lucene indexing configuration, it will add it to [0..N] indexing queue(s) as needed. Each asset may:
- Get added to the "Global" indexing queue and/or
- Get added to 1+ "Asset Type-specific" indexing queue(s) and/or
- Not get added to any queue at all.
- Indexing daemon periodically processes the indexing queues, assets get re- / de- / indexed as needed, based on the existing data.
- This is important to know because it implies assets get re-indexed even when a publishing session (or at least the publishing group that asset was included in) fails.
Setting Up Indexing
Lucene Indexing is set up in WCS on a per-asset-type basis. For any asset type, you can decide:
- If (all) assets of that type get indexed or not, i.e. if indexing is enabled or disabled for that asset type.
- For any asset type being indexed:
- The attributes that get indexed and the ones that don't.
- Whether or not binary attributes (files) should get indexed.
The steps for accomplishing all of the above are explained in the official documentation, which you can find here.
These are just some of the particularities about Lucene indexing in WCS that you should consider when designing the indexing strategy for your WCS implementation:
- You cannot enable indexing for a specific flex definition / subtype. Granularity is set at the asset type level.
- When you enable binary indexing, you are doing so for all assets of all types; whether a specific blob attribute - i.e. the contents of the file(s) referenced by its value(s) - gets indexed or not is determined by making that specific attribute indexable.
- When you enable binary indexing, you are doing so for all assets of all types, regardless of there being indexable blob attributes.
- Lucene indexing in WCS occurs asynchronously, even if almost continuously (running in the background).
- Front site search-specific indexing takes places upon publishing, on the publishing target. Staging (a.k.a. Management) instances are irrelevant in this specific context.
The Indexing Asset Event Listener And Supporting Queues
Assets make it to the appropriate indexing queues by means of an asset event listener which gets triggered every time an asset is created / updated / deleted: com.openmarket.basic.event.SearchAssetIdEventListener
All asset event listeners are registered and enabled / disabled by means of the ASSETLISTENER_REG table.
The OOTB indexing asset event listener's only job is to:
- Determine, for each asset, on which indexing queues it should be inserted, if any, and
- Inserting each assets (type + id) into the due indexing queues.
WCS' OOTB indexing queues are implemented as persistent (DB-backed) FIFO queues.
These queues are unique to each environment. More specifically, in clustered environments, all nodes write to the very same, unique queues.
By default, there is a queue for "Global" indexing which is backed by the "Global_Q" table and an asset type-specific table backs each asset type-specific indexing queue, e.g. named "<asset type>_Q".
Albeit not officially supported, you can replace the OOTB indexing asset event listener with your own implementation. However, you are strongly advised not to do so unless you really (really really) know what you are doing.
We won't discuss here what would justify your implementing your own indexing asset event listener. One possible scenario is you've got a client that needs the OOTB indexing queues infrastructure replaced - for whatever reason - with, say, a JMS-driven one using MQSeries or similar.
The Indexing Daemon
Indexing queues are processed by means of WCS' built-in indexing daemon.
This indexing daemon is implemented as a SystemEvent entry / element: SearchIndexEvent.
This event triggers:
- Constantly: meaning it gets (re-)launched in the concerned environment as soon as the ongoing execution completes (or errors out),
- Uniquely: meaning it runs on a single node of the concerned environment (cluster) at any given time, and
- Locally: meaning each environment (e.g. MANAGEMENT, DELIVERY) has its own indexing daemon.
If you ever need to alter the periodicity of the indexing daemon (or any aspect of that SystemEvents entry), you are strongly advised to do so by using either Sites Explorer (formerly known as "Content Server Explorer") or (UPDATE) SQL statements.
If you ever replace the OOTB indexing asset event listener with your own implementation, you may also need to implement your own Indexing Daemon, too. That, of course, ultimately depends on what your custom implementation does.
The Lucene Index Files
Lucene index files are shared by all nodes of the same cluster (environment). They are closely related to the way WCS' OOTB indexing is implemented.
As you know, WCS' OOTB indexing mechanism relies on 2 types of Lucene index files: Global and Asset Type-specific.
You can find more info about these 2 types of index files here.
The important bit here is understanding that all of these OOTB Lucene index files are asset-oriented.
The main implication of the above usually hits novice WCS developers in the face when they are first required implementing site search on entire pages rather than at the asset level.
- URL ".../Satellite?c=Page&cid=123&pagename=ArticlePageLayout&..." produces a webpage containing data that has been extracted not only from the main asset - Page:123 - but also from other assets in the system.
- Some of those "other assets" drive the production of widgets (for instance, via sub-pagelets) which don't represent entire pages on their own but just bits of information that can be embedded into other webpages.
- There are zero occurrences of the term "scientific" across all of Page:123's attribute values, but it does occur in 2 of the widgets that end up getting embedded inside the webpage produced by the URL above.
- If someone searches for "scientific" on your website, Page:123 doesn't show up in the results unless you implement some convoluted logic on top of your Lucene search in order to determine which matching widgets are embedded inside the webpage driven by Page:123.
Sure, there are ways to work around scenarios like the above, but all of them imply customizing the way WCS typically indexes assets.
For instance, for a case like described above, you could customize WCS' OOTB indexing so that, whenever a Page asset is indexed in Lucene, any widget asset bound to it also gets indexed as part of the concerned Page asset's document. This approach is generally applied to "Global" indexing, but you might find valid Asset Type-indexing use cases, too.
You could also customize what gets sent to the search engine index, where (in which index file) it gets stored or how data gets organized inside the index file, but that requires a deeper customization of WCS which is off the scope of this article.
WCS' (Lucene) Search API
WCS provides you with a Search API that abstracts you from the most basic Lucene-specific setup as well as some of Lucene's own search API objects.
The API is covered by both the official documentation and other articles around the web, so we won't delve into its fundamentals.
It is not unusual to find yourself in a situation where WCS' Search API falls short with regards to the kind of (Lucene) query you need to express.
In such cases, you have no alternative but to build your custom Lucene query on your own and use Lucene's own API (and perhaps a couple of utility methods in WCS' own API) for executing it.
WCS having its own object-driven query building API is, in our opinion, an unnecessary abstraction which creates an unnecessary overhead in terms of memory and CPU usage, especially when marshalling and "transforming" search engine-specific results into these (presumably) search engine-agnostic objects.
Also, WCS' API allegedly provides a mechanism for abstracting the actual search engine, too.
However, that mechanism is probably not the best architectural design pattern since it kind of implies there should be - at the very least - an understanding on the WCS side of the search engine's configuration settings.
More and more frequently, we are seeing on the field - especially over the last 4-5 years - many clients customizing WCS so to replace Lucene as WCS' public site search engine with a different product, typically SOLR or SOLR Cloud.
From an architectural stanpoint, the above approach implies the search engine becomes a tier on its own, typically hosted on a separate infrastructure, sometimes controlled by a totally different organization / department / team (be it a server, a VM or a Cloud -- for ex.: a SOLR Cloud).
Under such approach, WCS becomes nothing but a mere consumer of the services exposed by the search engine tier through an API; typically, a Java-based, HTTP-driven (e.g. REST) one. For example: SolrJ.
Cache Flushers play a crucial role in leveraging the above approach. We talked briefly about cache flushers in this other article; you are more than welcome to check that out. WCS' official documentation also mentions them here and there.
We will not dive into the details around such customization, but given the extensive experience we at Function1 have undertaking those, you will most probably see this explained in more detail in an upcoming article.
Lucene's Analyzers, Tokenizers and Filters.
If you are commited to expressing your search queries using WCS' Search API and other built-in mechanisms, yet still need a way to adapt to your needs the way Lucene digests asset data, you must not forget about Lucene's own hook-up points: analyzers, tokenizers and filters.
These are the natural places where you should customize the way asset-specific data gets tokenized and, ultimately, stored in your (Lucene) index files.
With a custom field analyzer you can customize the way data from a given field gets examined, then transformed into a token stream.
An analyzer is a factory for analysis chains, i.e. an analyzer may be a single class or it may be composed of a series of tokenizer and filter classes chained together, where the output of one is the input of the next.
The resulting output of an analyzer is used to match query results or build indices.
Tokenizers are only responsible for breaking the input text into tokens. Multiple tokenizers could be applied to the same original input in order to produce different output, or to split up the production of the ultimate tokens stream, a-la-production-line.
Filters modify a stream of tokens and their contents. They examine a tokens stream and decide to keep, transform or discard each of them -- or even create new ones.
Analyzers are field-aware; tokenizers and filters are not. Analyzers may take a field name into account when constructing the tokens stream.
You can implement and plug your own Analyzer by:
- Making sure it implements the AnalyzerFactory interface, and
- Registering it by modifying the "properties" field of the appropriate row in the SearchEngineMetaDataConfig table as per the official documentation.
Based on the above, it should result obvious why these 3 kinds of (Lucene) components can be of great help in either customizing the indexing strategy WCS ships with and/or empowering the kind of Lucene-specific search expressions that your application needs.
Custom analyzers / tokenizers / filters can be extremely powerful tool when adequately combined with an effective asset model (incl. custom Asset Event Listeners, Flex Filters, PostUpdate elements and/or CacheUpdater) which ensure meaningful, complete data actually gets sent from WCS to Lucene whenever needed.