International Core Journal of Engineering 2020-26

x A Deep Reinforcement learning based Ranking Strategy (DRRS) is proposed to maximize the cumulative reward of the platform. The ranking ∗ and DDPG is function is redefined as utilized to learn . in recommender system. However, most existing DRL methods consider the platform is non-profitable and some methods do not stand on the side of platform. In our paper, we propose a DRL-based framework DRRS to maximize the cumulative revenue of the platform. x A recommender system simulator is built to obtain immediate feedback to avoid hurting the performance of the commercial platform. III. M ETHOD In this section, the definition of recommendation task is first introduced, and then a DRL based framework DRRS is proposed for recommendation. Finally, a recommendation interaction simulator is designed and the details of offline learning procedure are illustrated. x Experiments conducted on real-world datasets further demonstrate the proposed method performs much better than the traditional ranking algorithms. A. Preliminaries The recommendation task is defined as the platform chooses recommended items over a sequence of time steps and charges the advertiser for the positive feedback (click). is given by the corresponding The bid price of -th item provided as advertiser. The predicted CTR of -th item an input is computed by the online predicted model (e.g., LR). Traditional recommendation methods apply greedy ranking ∗ , which represents for expected function revenue of recommending -th item, to maximize the immediate reward. Taking the cumulative reward into account, , ) and the we redefine the ranking function as ( goal of proposed model is to learn the ω which maximizes the cumulative revenue of the platform. The rest of this paper is organized as follows. Related work is illustrated in Section Ⅱ. The proposed method DRRS is introduced in Section Ⅲ. After that, the experiment results are shown in Section Ⅳ. Section Ⅴ concludes this paper and proposes future work. II. R ELATED WORK The auction mechanism workflow in recommender system can be described as: each advertiser provides a bid price to the platform for an advertisement response (e.g., click/download in App Store). Then the platform orders the advertisements based on a pre-defined ranking function and presents the advertisements with the top-ranking scores to the users. Finally the platform charges the advertisers the bid price for the positive responses of users. In our work, the generalized first price (GFP) mechanism is applied for charging and the goal is to maximize the long-term revenue for the platform. Traditional ranking strategies only consider immediate reward and apply greedy ranking strategies. In this paper, a Deep Reinforcement learning based Ranking Strategy (DRRS) is proposed to obtain ω . To formulate the recommendation procedure as an RL task, we firstly model the user-item interactions as a Markov Decision Process (MDP), which consists of a sequence of states, actions and rewards. The fundamental elements of MDP formulation of DRRS are defined. Reinforcement learning (RL) is always modeled as Markov Decision Process (MDP) [8] and concerned with how agents ought to take actions in an environment to maximize some pre-defined cumulative rewards. Since that simple RL methods (Q-learning, SARSA) can’t deal with the task with continuous state space or action space, deep reinforcement learning (DRL) is proposed. DRL extends reinforcement learning to the entire process from high-dimensional observations to continuous actions by forming it using a deep network and without explicitly designing state space or action space. DRL has been applied successfully to various problem, including game of go [10], robot control [11], video games [12], etc. As an important model for DRL, DDPG used in this paper is an Actor-Critic algorithm which generates an action based on the given state in Actor part and approximates the optimal state-action value function in Critic part. However, most existing DRL methods for recommendations [13, 14] assume that the platform is non-profitable, which means the reward of these models does not contain the revenue of the platform. These methods only consider the ranking quality. Reference [15] considers the platform revenue, but its central idea is to find a proper ranking function which can simultaneously satisfy the platform, the users and the advertisers rather than to maximize the platform revenue. is defined as the user’s click x State : The state history sorted in chronological order before time t. x Action α : The action α is defined as a continuous vector. The ω involved in the ranking function is defined as the inner product of α and the item embedding . x Reward : After the platform receives the state and selects an action to recommend items to users, the reward is defined as the feedback provided by the users, e.g., click/not click. x Transition : As the state is defined as the click history of the user, when the feedback is collected, the next state is determined. x Discount factor γ: γ ∈ [0,1] controls the importance of future rewards. For example, if γ = 0, the model only considers the immediate reward. When γ = 1, the model considers the future reward and immediate reward are equally important. The user-agent interaction of recommender system is shown in Fig. 2. The problem of recommendation task can be defined as: given the interaction history of recommender-user in MDP form, how to find a policy obtaining α to maximize the cumulative reward of recommender system. To summarize, traditional ranking strategies regard the recommendation interactions as a static process and only consider the immediate reward. To maximize the cumulative reward, DRL is introduced to optimize the ranking functions 23

International Core Journal of Engineering 2020-26 | Page 45