Friday, April 26, 2013

Generating recommendations via Elasticsearch - Part 1

Let's look at what it entails to roll your own recommendation engine with Elasticsearch.

A simple use case: Given a product being browsed, we want to provide suggestions based on what others have purchased in the past.

In order to accomplish this we will break down the steps required:

  1. Index the past sale data, invoices or receipts in Elasticsearch.
  2. Run a query that narrows down the results to a product that consumers are browsing and then returns suggestions for other items that are most frequently purchased together with it.
    Photo 1
  • Here's an example of a sale but what's the correct structure for the JSON document which we should index into Elasticsearch, in order run the type of query which will yield the results for our use case?

    One optimistic way to do this is to simply store the receipt itself as a JSON document:

    But that just won't work because there isn't any query (known to me at least) that can make heads or tails of that one document alone and deliver aggregated results of items most frequently bought together.

  • Photo 2
  • We could generate one document to represent each item for which a query might come in and then add lineitems from various sales to it as a multi value field and this data could be what we facet upon to deliver aggregated results of items most frequently bought together.

    Lets keep our focus on Apples and rest of the sibling lineitems will become recommendations for someone who comes looking for Apples:

    Any additional sales will add to the number of elements in recorded lineitems:

    But this doesn't work because the facet query will report that there is one Orange, one Lemon etc. It will not count Oranges as 2 but rather as 1 inside the lineitems field. This is because the facet query is meant to count the number of documents that have the term "Oranges" but not the number of times a term (such as Oranges) appears per document or field.

  • Photo 3
  • What next?

    We can break apart the receipt into (total # of lineitems) * (total # of lineitems-1) documents as pairs. This way when a new sale appears, similar pairs can be identified and the facet query can accurately count them because they are all separate/individual documents.

  • Photo 4
  • Now we can see that the documents (one lineitem pair per document) from Sale1 and Sale2 can finally be used by a facet query to suggest that Apples are most often bought together with oranges and vice-versa.

1 comment:

  1. Couldn't you just get the aggregation of the line items where line items contains the current product (obviously minus the current item)? I don't understand why you need to build another data structure.