
Abstract: For every business, and particularly for a growing company like Wix, it is crucial to have full grasp over the future incoming cash flow. This allows us to optimally plan and allocate money for operation costs and investments. Being a public company, the forecasts are also important as part of guidance given to investors for the upcoming fiscal quarter or year.
Wix makes most of its revenue from paid subscriptions of new and existing users. The number of premium subscriptions and the amount of cash collected are the targets we want to forecast. In contrast to the usual approach of treating this as a time-series problem where we target specific dates, we can treat this as a regression problem where target values are defined by user registration date and age. This is what we call a cohort-based model.
Users that registered on a certain date represent a cohort. This cohort has its own features (e.g. size, country) which are joined into a table based on their registration date. In the dimension of time, regardless of age, all cohorts are subject to the same events, like seasonality, holidays, general trend, discounts, price changes. These features are joined to a table by the upgrade date, which defines the cohort age. Thus a row in a table should read something like: “cohort of users registered on 2021-01-01 that is 3 days old on upgrade date 2021-01-03 produced 100 subscriptions”.
This approach allows the use of regression models which normally can’t be applied to time-series data, for example GLM, GAM and GBM with Poisson or Tweedie distributions. In terms of error rates, these cohort-based models proved to be at par or better than time-series models like Prophet.
Bio: Nicolai Vicol is a Data Scientist at Wix, where he specializes in forecasting of new users, paid subscriptions, cash flows and generally everything related to time-series. He started his career as a quant in an investment bank, then switched to data science and IT, accumulating in total 9 years of experience in the field. Areas of interest: time series and forecasting, but also recommendation systems, search systems and operation research.