How to create a custom Elasticsearch analyzer that is insensitive to accents

There are some Elasticsearch analyzers that are sensitive to accents, for example, the Catalan analyzer returns no results if you search "diputacio", even if the indexed word is "diputació".

This can be problematic, as it is a common practice for accents to be left out of search queries by users in most languages, so accent-insensitive search is an expected behavior.

As a workaround, to avoid this behavior at the Elasticsearch level, it is possible to add an "asciifolding" filter to the out-of-the-box Elasticsearch analyzer.

Environment

  • Liferay DXP 7.0, 7.1, and 7.2
  • Elasticsearch 2.x through 7.x

Creating Your Custom Analyzer

In order to create your custom analyzer starting from an existing one, you have to copy the original configuration from the Elasticsearch documentation and apply the desired modification.

For example, if we want to redefine the catalan analyzer applying "asciifolding", we will have the following definition:

    "analysis": {
      "filter": {
        "catalan_elision": {
          "type":       "elision",
          "articles":   [ "d", "l", "m", "n", "s", "t"],
          "articles_case": true
        },
        "catalan_stop": {
          "type":       "stop",
          "stopwords":  "_catalan_" 
        },
        "catalan_stemmer": {
          "type":       "stemmer",
          "language":   "catalan"
        }
      },
      "analyzer": {
        "catalan": {
          "tokenizer":  "standard",
          "filter": [
            "catalan_elision",
            "lowercase",
            "asciifolding",
            "catalan_stop",
            "catalan_stemmer"
          ]
        }
      }
    }

Changes:

  1. "asciifolding" was added between "lowercase" and "catalan_stop" filters.
  2. "catalan_keywords" section was removed as we don't have any keywords to define.
  3. The analyzer was named "catalan", the same name as the existing out-of-the-box definition, because we want to overwrite it: we don't want to create a new one with a different name.

Applying your Custom Analyzer

If you want to apply your custom analyzer to a Liferay index, it is necessary to configure it on the Liferay side, as an analyzer can only be modified at the time of index creation. It is not possible to redefine an analyzer for an existing index.

To apply your custom analyzer in a Liferay installation, follow these steps:

  1. Navigate to Control Panel → Configuration → System Settings
  2. Find the Elasticsearch 6 (or Elasticsearch 7) entry (scroll down and browse to it or use the search box) and click on the Actions icon (Actions), then Edit.
  3. Go to "Additional Index Configurations" and copy the JSON fragment with your custom analyzer there.
  • Note: In case you have any existing setup in this field, it is necessary to merge both JSON configurations.
  • Click Save.
  • In order to apply the new analyzer, execute a full reindex: Navigate to Control Panel → Configuration → Search in DXP 7.2 and 7.1) or Control Panel → Configuration → Server Administration in DXP 7.0 , and click Execute next to Reindex all search indexes.
  • Important: A full reindex can take a long time, it can be very CPU intensive and all search functionality can be affected. Consider executing this when your system has low-activity (e.g., at night)

    Once Liferay indexes are regenerated, they will have a custom analyzer. In our example, we will have a "catalan" analyzer that will overwrite the out-of-the box catalan analyzer. This configuration will only apply to Liferay indexes, so it won't affect any other third party index that is stored in the same Elasticsearch cluster.

    Additional Information

    If you want to configure an analyzer that is insensitive to accents in a field that it is not localized (= its name doesn't end in any language id), you will need to apply the analyzer to that field in the "Additional Type Mappings" section of the Liferay Elasticsearch Connector configuration, see:

    For example, you can apply your new "custom_analyzer" to "lastName" field adding this configuration to the "Additional Type Mappings" field:

    { 
        "LiferayDocumentType": {  
            "properties": {   
                "lastName": {
                    "analyzer": "custom_analyzer",
                    "store": "true",
                    "type": "text"
                }
            }   
        }
    }

    In case you want to define several analyzers, it is necessary to merge the JSON configuration of each one, leaving only one section for the "filter" and "analyzer" configurations.

    For example, in case you want to redefine "catalan" and "spanish", applying "asciifolding", you will have to configure as follows:

    "analysis": {
        "filter": {
            "catalan_elision": {
                "type": "elision",
                 "articles":   [ "d", "l", "m", "n", "s", "t"],
                "articles_case": true
            },
            "catalan_stop": {
                "type": "stop",
                "stopwords": "_catalan_"
            },
            "catalan_stemmer": {
                "type": "stemmer",
                "language": "catalan"
            },
            "spanish_stop": {
                "type": "stop",
                "stopwords": "_spanish_"
            },
            "spanish_stemmer": {
                "type": "stemmer",
                "language": "light_spanish"
            }
        },
        "analyzer": {
            "catalan": {
                "tokenizer": "standard",
                "filter": [
                    "catalan_elision",
                    "lowercase",
                    "asciifolding",
                    "catalan_stop",
                    "catalan_stemmer"
                ]
            },
            "spanish": {
                "tokenizer": "standard",
                "filter": [
                    "lowercase",
                    "asciifolding",
                    "spanish_stop",
                    "spanish_stemmer"
                ]
            }
        }
    }
    这篇文章有帮助吗?
    2 人中有 2 人觉得有帮助