ML Implementation – lessons from Booking.com

I came across an interesting paper published by data scientists at Booking.com courtesy of the podcast Linear Digressions (@LinDigressions). The paper summarizes 6 business lessons learnt by the data scientists at Booking.com after deploying 150 Machine Learning (ML) models.  These observations are insightful and offer a practitioner view regarding implementation of ML models for business purpose and some are even non-intuitive and raise questions or grounds for more research;

  1. Use Classes of Models: Some Machine Learning (ML_ models are created at Booking.com for very specific use and some for more broad-based use which are used by other specific-use case ML models. This to be analogous to the models used in pricing in financial industry; some are used to price specific payoffs while there are some market models which are used by all the payoff models.
  2. Spend time in Designing the model (Alternative Approaches): The authors spend a lot of time in designing and model ‘set-up’ stage and consider alternative approaches and among the factors considered by them are complexity, data availability and selection bias. For practitioners in financial industry this is a core model development principle under the regulatory guidelines for model risk (SR 11-7) where model developers are recommended to show alternative approaches that lead to final model selection.
  3. Using Model Performance Measurement:
    • Measure performance of the ML model relative to a baseline (median) performance computed for other ML models in that type/class
    • Measure offline (before implementation) performance improvement of the ML model in successive versions against the first version of the ML model to check the lift from continued model re-development.
    • Use Random Control Trials (RCT) to check the ML model performance after implementation. At Booking.com, this process is integral to the model development  and enabled through use of an infrastructure which allows hypothesis to be tested before deployment and is further elaborated in the last point.
    • Why Model Performance does not always lead to Business Value: Increasing the performance of a model (AUC, ROC) does not always lead to a gain in business value and in some cases there is negative correlation between business value and improvement observed in model performance of the ML models.  This is non-intuitive but explained by the data scientists at Booking.com as being due to performance over-saturation; segment saturation (less data of users to test); over-optimization which could be overfitting and the one I found most interesting was the increased accuracy causing negative user feedback i.e. users freaking out because the model was so accurate and predicted what they were thinking for a destination in the example cited.
  4. (Ongoing) Monitoring: The data scientists used Response Distribution Charts (RDC) for monitoring performance of binary classification models while making predictions when the ‘true value’ is not known immediately. The intuitive claim is that for an ideal binary classification model there should be a peak at 0 and a peak at 1 for the RDC based proportional to the data. This could be applicable to lot of classification models where the ‘ground truth’ is not known immediately.
  5. Time is Money and handling Latency: ML models require significant computation to make predictions which increases the latency and causes negative user feedback and so the authors of the paper have mentioned proven techniques that have worked like precomputation and caching, batching requests over the network and having redundant copies of the models available over the network etc.
  6. Model Evaluation using Random Control Trials: The data scientists at Booking.com conduct sophisticated experiments using variants of Random Control Trials (RCT) to assess the impact of the new features of the ML models on the users which will be relevant to lot of business use cases for ML models such as;
    • Triggered Analysis is used where only the treatable subjects (users for which the new feature of the model was triggered) in the Control and Testing groups are analyzed.
    • Model Output Dependent Triggering is used in cases where users in treatment groups are exposed to new features of ML models that are triggered based on another new feature; in these cases the control group is exposed to no change at all and two treatment groups are used; one where triggered users are exposed to a change and second where users are not exposed to any change regardless of the model output. The statistical analysis is conducted using only triggered subjects from both treatment groups.
    • Using RCT to measure latency or other unintended effects introduced from new features of ML models.
    Overall, I found the paper to an interesting read as it offers a less researched aspect of delivering business value using ML models by combining different disciplines of software engineering, project management, hypothesis testing, infrastructure and  data analysis.

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s