Zhu Wu's Blog

The world is a fine place and worth fighting for.

An ElasticSearch-based Address Autocompletion Solution

It is common that users need to key in their addresses on e-commerce or public service applications. Providing address autocompletion functionality could greatly improve user experience and reduce typo errors. Here, I will provide an address autocompletion solution implemented upon ElasticSearch.

Addresses are usually maintained by the relative authority of the state and they have a standard format. For example, the address in Australia is defined by the following format:

1               Pitt St       Cowra    NSW     2794
(Street number) (Street name) (Suburb) (State) (Postcode)

However, the user input may not be able to fully conform to the standard format (e.g., "Pitt St 1 NSW Cowra 2794"), or it misses certain part (e.g., "Pitt St 1 NSW 2794").

A naive solution is to use a match query to match user input against the indices. However, this query could be slow, because certain words are too common in addresses, e.g., the number "1", or the word "street". Thus, ElasticSearch needs to scan through a lot of indexed addresses when such common words appear in the query and the query could be very slow.

Anyway, the user input should still partially follow the standard format, otherwise it could not represent a meaningful address any more. Thus, we can exploit the partially order to make a more efficient query. The user input are extracted into bigrams. Each bigram form a span near query of two terms with a meaningfully large slop. The address needs to contain both words specified in the span near query and the position difference of the two words need to be smaller than the slop. Thus, much less addresses need to be scanned and it could greatly improve performance. Also, when user input has missing part or partially out of order, the order information between words in user input still presents in the query and helps to find the target result.

Abbreviation are usually allowed in addresses. For example, "St" stands for "Street" and "NSW" stands for "New South Wales" in the example address above. The solution should be able to return the target address regardless of user input is in abbreviation or full form. We can use two fields to store the full form and abbreviation separately, and we need to query against these two fields. Alternatively, we can concatenate full form and abbreviation and store it in one field, and then only the concatenated field needs to be queried.

In a typical autocompletion implementation, a list of addresses are suggested when user is typing (it is also referred to as search as you type). For example, when user types "35 Stir", "35 Stirling Rd" may be suggested. To achieve that, we can use edge ngram index to index the addresses. However, in some cases, a word in an address may be a suffix of a word in another address (e.g., in "1 Pitt St" and "1 Pittsford St", "Pitt" is the suffix of "Pittsford"). When user types "1 Pitt", both "1 Pitt St" and "1 Pittsford St" may be suggested and "1 Pitt St" should appear before "1 Pittsford St". Of course, when user types "1 Pitts", "1 Pittsford St" is suggested and "1 Pitt St" is no longer suggested. Therefore, we need another term index to index the addresses. Upon user input, both edge ngram index and term index are queried. When the user input is "35 Stir", it matches the edge ngram index of "35 Stirling Rd", so "35 Stirling Rd" is suggested. When the user input is "1 Pitt", it matches edge ngram index and term index of "1 Pitt St", but it only matches edge ngram index of "1 Pittsford St". Since "1 Pitt St" has a larger matching score than "1 Pittsford" in this case, "1 Pitt St" will be suggested first, followed by "1 Pittsford St".

Let's put all the details discussed above together.
When creating indices:
- The address in full form and the abbreviated address concatenate together.
- The concatenated address is indexed into two fields. One is indexed by an edge ngram analyzer, and the other is indexed by a term analyzer.
When querying:
- The input is broken into bigrams.
- Each bigram forms two span near queries. Each span near query query against the edge ngram index field and term index field respectively.
- All the span near queries are joined by or condition.