Exploration versus exploitation in assortment optimization with limited inventory and substitutable demand

Document Type : IIIEC 2021


Graduate School of Management and Economics, Sharif University of Technology, Tehran, Iran


This study considers an online multi-period assortment optimization problem over multiple replenishment cycles where the seller chooses a subset from N substitutable products and decides the limited amount of each to order and sell at every period. The seller is constrained by a total inventory capacity, a cardinality constraint on the product variety (shelf space), and predetermined replenishment time intervals. The assortment selection is modeled as a Multi-armed bandit problem and the customers' choice is modeled by the MNL choice model. The objective is to optimize the revenue by learning the demand parameters and improve the offering composition at every period. In this novel approach, the offering and consequently the exploration-exploitation decision has two dimensions: the assortment and the inventory allocation. The present research develops a model and policy for learning and optimization that demonstrates good performance in numerical simulations. The results suggest that capacity constraint has a significant impact on the total profit of a seller who tries to learn the demand and the best inventory composition on the fly.


Main Subjects

Agrawal, S., Avadhanula, V., Goyal, V., & Zeevi, A. (2019). Mnl-bandit: A dynamic learning    approach to assortment selection. Operations Research, 67(5), 1453-1485.
Agrawal, S., Avadhanula, V., Goyal, V., & Zeevi, A. (2017, June). Thompson sampling for the mnl-bandit. In Conference on Learning Theory (pp. 76-78). PMLR.
Chapelle, O., & Li, L. (2011). An empirical evaluation of thompson sampling. Advances in neural information processing systems, 24, 2249-2257.
Caro, F., & Gallien, J. (2007). Dynamic assortment with demand learning for seasonal consumer goods. Management science, 53(2), 276-292.
Chen, B., & Chao, X. (2020). Dynamic inventory control with stockout substitution and demand learning. Management Science, 66(11), 5108-5127.
Chen, X., & Wang, Y. (2018). A note on a tight lower bound for capacitated mnl-bandit assortment selection models. Operations Research Letters, 46(5), 534-537.
Chen, W., Wang, Y., & Yuan, Y. (2013, February). Combinatorial multi-armed bandit: General framework and applications. In International Conference on Machine Learning (pp. 151-159). PMLR.
Gopalan, A., Mannor, S., & Mansour, Y. (2014, January). Thompson sampling for complex online problems. In International Conference on Machine Learning (pp. 100-108). PMLR.
Kök, A. G., & Fisher, M. L. (2007). Demand estimation and assortment optimization under substitution: Methodology and application. Operations Research, 55(6), 1001-1021.
Mahajan, S., & van Ryzin, G. J. (1999). Retail inventories and consumer choice. In Quantitative models for supply chain management (pp. 491-551). Springer, Boston, MA.
McFadden, D. (1973). Conditional logit analysis of qualitative choice behavior.
Powell, W. B. (2019). A unified framework for stochastic optimization. European Journal of Operational Research, 275(3), 795-821.
Rusmevichientong, P., Shen, Z. J. M., & Shmoys, D. B. (2010). Dynamic assortment optimization with a multinomial logit choice model and capacity constraint. Operations research, 58(6), 1666-1680.
Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4), 285-294.