"Prediction is Very Hard, Especially About Conversion"

Posted on March 6, 2020

I recently came across the paper Prediction is very hard, especially about conversion which describes a similar machine learning project to one I did for Haymarket Automotive so I thought it would be interesting to do a quick review of the paper and compare it with my own experiences.

Knowing if a user is a “buyer” vs “window shopper” solely based on clickstream data is of crucial importance for e-commerce platforms seeking to implement real-time accurate NBA (“next best action”) policies. However, due to the low frequency of conversion events and the noisiness of browsing data, classifying user sessions is very challenging.

My task was similar (identify those ready to buy a car vs. those who are just browsing the site for entertainment). These are unbalanced classes - there are way more just browsing the site - but not to the same extent as for an ecommerce store where less than 10% of people convert.

They simplify the clickstream data by converting every hit into one of “view”, “click”, “detail” (product detail page), “add to cart”, “remove from cart” and “buy”. This is a good summary of everything a user can do on a fashion ecommerce site but it means their algorithms can’t use any metadata about the products or pages that the user is interacting with. I believe that when they are doing this “for real” the metadata is used and it is just being excluded so that the results from this paper can be replicated - if the metadata were necessary they would have to share confidential client information with anyone who wanted to replicate their results.

Imagine a user who visits the homepage and bounces straight off the site never to be seen again - it is very difficult to say anything about the intent of this person except that if they were intending to buy when they first loaded the page then they changed their mind fairly quickly. Because of this they exclude all sessions with fewer than 10 hits from the data - this represents more than half of all the sessions - the 75th percentile value is 12 so it might be more than 70% of sessions that are discarded!

All sites will have the same problem to a greater or lesser extent. In my work with Haymarket we chose not to exclude any sessions but our algorithms used more detailed metadata about the page (or pages) a user was looking at which made making a non-random prediction possible for these users.

Still, it is important to keep in mind that the authors have selected an easier version of the problem by only looking at the fraction of sessions where they have lots of data.

Baseline Model

One of the reasons I like this paper is that they start with a very simple model and then build up from there to more complicated models.

The model they start with is naive bayes on event n-grams.

Firstly they 5-gram (I’ll explain why 5 is chosen later) the event stream. So if you have a session that looks like “VIEW VIEW CLICK DETAIL CLICK DETAIL ADD VIEW VIEW BUY” then the 5-grams are every subsequence of length 5.

Then the 5-grams for a session are treated “bag of words” style which means the order and frequency of the 5-grams within the session is ignored - it only matters whether or not a particular 5-gram appeared in the sequence or not.

A naive bayes classifier is then used on the 5-grams to make the prediction. They use a cutoff probability of 50% so the prediction is BUY if the probability is >50% and NOBUY otherwise. They don’t present a ROC curve or anything like that so readers don’t know anything about specifity/sensitivity.

The accuracy for this method is 82.1%.

5-grams are used because they tested n-grams for n=1,2,3,4,5 and 5-grams performed the best. I guess they stopped at 5 because they were worried about overfitting.

Markov Chains

The second method they test is training two Markov chains; one on the BUY sessions and one on the NOBUY sessions.

To make predictions they test to see how likely it is that a session was generated by each Markov chain. The one with the highest probability is the prediction.

By default a Markov chain would only look at the step immediately preceding the current one (this is the “Markov property”) but they modified the algorithm to look at the five previous steps - similar to the 5-grams above. Again they picked 5 after some testing but they don’t say which numbers were tested for the order of the chain.

The accuracy of this method is 88% - quite a big improvement from naive Bayes

LSTM

LSTM is one of my favourite acronyms; it stands for “Long Short Term Memory”!

It is a type of neural network for generating and predicting sequences. It is much more sophisticated than Markov chains for this.

I’m unclear how they use the LSTM network to make a prediction for this model. What I thought was the normal way of doing this comes later in the paper so I’m not quite sure what they do at this stage. I should read the cited papers to find out I suppose.

My best guess is that they make predictions in the same way as they did for the Markov chain models; by training two networks and then seeing which network (the BUY or NOBUY network) was the most likely to have generated the session.

Like with the previous methods there are parameters that are tuned by testing the accuracy of the model on the test set. Naive bayes and Markov chains only have one parameter to test (the “n” in n-grams or the order of the chain) but a neural net has a lot more. This can be very time consuming - in terms of compute time - to figure out.

we did not […] explore the effect of dropout

Adding dropout layers worked quite well for me - it meant I could train the model for more epochs without overfitting which helped boost accuracy.

The accuracy for this method is 90.9%.

LSTM (but more complicated)

This is much more similar to my efforts with LSTM.

Rather than using LSTM as generative network as then calculating the probability of a particular sequence being generated this way they make the prediction part of the output of the network. This means that all the fancy gradient descent and neural network optimisation magic can be applied directly to improving the predictions rather than just improving the generative process.

Two pooling strategies were explored, changing the information that is used to classify sequences: taking the output of the LSTM at the last time step (ignoring padding indices) and taking the average LSTM out put over the entire sequence (again ignoring padding indices).

The pooled output of the LSTM was then passed through the fully connected layer and transformed using the sigmoid activation function, which was then taken as the prediction given the sequence

So they use two methods of linking the LSTM with the final bit of the model that makes the prediction:

  1. Average across the whole sequence
  2. Using the LSTM output at the end - this is kind of like saying that the actions that take place closer to the BUY action are more predictive.

This is totally new to me - I had not thought of this way of looking at things.

The accuracy of the two methods is:

  1. Average pooling: 92.7%
  2. “Last” pooling: 93.2%

Further Thoughts

These thoughts relate specifically to the project I was working on.