How to create a custom Elasticsearch analyzer that is insensitive to accents

There are some Elasticsearch analyzers that are sensitive to accents, for example, the Catalan analyzer returns no results if you search "diputacio", even if the indexed word is "diputació".

This can be problematic, as it is a common practice for accents to be left out of search queries by users in most languages, so accent-insensitive search is an expected behavior.

As a workaround, to avoid this behavior at the Elasticsearch level, it is possible to add an "asciifolding" filter to the out-of-the-box Elasticsearch analyzer.

Environment

  • Liferay DXP 7.0 - 7.3
  • Elasticsearch 2.x through 7.x

Creating Your Custom Analyzer

In order to create your custom analyzer starting from an existing one, you have to copy the original configuration from the Elasticsearch documentation and apply the desired modification.

For example, if we want to redefine the catalan analyzer applying "asciifolding", we will have the following definition:

    "analysis": {
      "filter": {
        "catalan_elision": {
          "type":       "elision",
          "articles":   [ "d", "l", "m", "n", "s", "t"],
          "articles_case": true
        },
        "catalan_stop": {
          "type":       "stop",
          "stopwords":  "_catalan_" 
        },
        "catalan_stemmer": {
          "type":       "stemmer",
          "language":   "catalan"
        }
      },
      "analyzer": {
        "catalan": {
          "tokenizer":  "standard",
          "filter": [
            "catalan_elision",
            "lowercase",
            "asciifolding",
            "catalan_stop",
            "catalan_stemmer"
          ]
        }
      }
    }

Changes:

  1. "asciifolding" was added between "lowercase" and "catalan_stop" filters.
  2. "catalan_keywords" section was removed as we don't have any keywords to define.
  3. The analyzer was named "catalan", the same name as the existing out-of-the-box definition, because we want to overwrite it: we don't want to create a new one with a different name.

Applying your Custom Analyzer

If you want to apply your custom analyzer to a Liferay index, it is necessary to configure it on the Liferay side, as an analyzer can only be modified at the time of index creation. It is not possible to redefine an analyzer for an existing index.

To apply your custom analyzer in a Liferay installation, follow these steps:

  1. Navigate to Control Panel → Configuration → System Settings
  2. Find the Elasticsearch 6 (or Elasticsearch 7) entry (scroll down and browse to it or use the search box) and click on the Actions icon (Actions), then Edit.
  3. Go to "Additional Index Configurations" and copy the JSON fragment with your custom analyzer there.
  • Note: In case you have any existing setup in this field, it is necessary to merge both JSON configurations.
  • Click Save.
  • In order to apply the new analyzer, execute a full reindex: Navigate to Control Panel → Configuration → Search in DXP 7.1 and 7.2, and under the Index Actions tab under DXP 7.3) or Control Panel → Configuration → Server Administration in DXP 7.0 , and click Execute next to Reindex all search indexes.
  • Important: A full reindex can take a long time, it can be very CPU intensive and all search functionality can be affected. Consider executing this when your system has low-activity (e.g., at night)

    Once Liferay indexes are regenerated, they will have a custom analyzer. In our example, we will have a "catalan" analyzer that will overwrite the out-of-the box catalan analyzer. This configuration will only apply to Liferay indexes, so it won't affect any other third party index that is stored in the same Elasticsearch cluster.

    Not localized fields

    The fields the are not localized (its name doesn't end in any language id) do not have any analyzer defined at field level, so they are analyzed using the standard analyzer as a fallback.

    If you want to configure them with an analyzer that is insensitive to accents you have two options:

    Option 1: Define a "default" analyzer in the index

    You can set a "default" analyzer in the index configuration. This default analyzer will be used instead of the standard one.

    So you can just define a new analyzer called "default" that is a copy of the standard analyzer, but adding the asciifolding filter:

        "analysis": {
          "analyzer": {
            "default": {
              "tokenizer": "standard",
              "filter": [
                "lowercase",
                "asciifolding"
              ]
            }
          }
        }

    Important note: All the fields with no specific analyzer will be analyzed with this new default one, so it can cause problems in case you have a field in the index that must be sensitive to accents.

    Option 2: Apply a custom analyzer in the type mappings

    Instead of defining a default index analyzer, you can also create a custom version of the standard analyzer with the asciifolding filter applied and configure it in the specific fields you want to make insensitive to accents.

    You will need to apply the analyzer to that field in the "Additional Type Mappings" section of the Liferay Elasticsearch Connector configuration, see:

    For example, you can apply your new "custom_standard_analyzer" to "lastName" field adding this configuration to the "Additional Type Mappings" field:

    { 
        "LiferayDocumentType": {  
            "properties": {   
                "lastName": {
                    "analyzer": "custom_standard_analyzer",
                    "store": "true",
                    "type": "text"
                }
            }   
        }
    }

    On the one hand, this gives you more control over which non-localized fields should be insensitive to accents, but on the other hand, you have to manually configure all the fields, so it will be more time-consuming.

    Defining multiple analyzers

    In case you want to define several analyzers, it is necessary to merge the JSON configuration of each one, leaving only one section for the "filter" and "analyzer" configurations.

    For example, in case you want to redefine "catalan" and "spanish", applying "asciifolding", you will have to configure as follows:

    "analysis": {
        "filter": {
            "catalan_elision": {
                "type": "elision",
                 "articles":   [ "d", "l", "m", "n", "s", "t"],
                "articles_case": true
            },
            "catalan_stop": {
                "type": "stop",
                "stopwords": "_catalan_"
            },
            "catalan_stemmer": {
                "type": "stemmer",
                "language": "catalan"
            },
            "spanish_stop": {
                "type": "stop",
                "stopwords": "_spanish_"
            },
            "spanish_stemmer": {
                "type": "stemmer",
                "language": "light_spanish"
            }
        },
        "analyzer": {
            "catalan": {
                "tokenizer": "standard",
                "filter": [
                    "catalan_elision",
                    "lowercase",
                    "asciifolding",
                    "catalan_stop",
                    "catalan_stemmer"
                ]
            },
            "spanish": {
                "tokenizer": "standard",
                "filter": [
                    "lowercase",
                    "asciifolding",
                    "spanish_stop",
                    "spanish_stemmer"
                ]
            }
        }
    }
    这篇文章有帮助吗?
    6 人中有 6 人觉得有帮助