International Core Journal of Engineering 2020-26

Fig. 1. Network structure of DRRSU Where denotes the embedding of -th item pre-trained by Probabilistic Matrix Factorization (PMF) [17]. The is equal to the dimension of item dimension of embedding. In a recommendation session, the ranking score of each item is calculated according to Equation (2), items with the top-ranking scores are recommended to the user. 3) Critic Network: The network structure of Critic, which is shown in the right part of Fig.1, can be regarded as a Deep Q-network. The state-action value (s,a) is learned to estimate the cumulative rewards for the given state and action. The network parameters of Actor part are updated in the direction of improving Q value, namely to select the best action for the given state. According to the deterministic policy gradient theorem [18], Actor network is updated by sampled policy gradient, shown in Equation (4). Fig. 2. ser-agent interaction in MDP B. Proposed Framework As mentioned in Section Ⅰ, traditional ranking strategies only consider immediate reward. We propose a Deep Reinforcement learning based Ranking Strategy (DRRS) to learn the parameter in ranking function which maximizes the cumulative reward of the platform. After obtaining the α from the Actor network, the ranking score of each item can be calculated and the platform recommends the items with top- ranking score to the current user. ∇ ∗ ( 1) Actor Network: The network structure of Actor is shown in the left part of Fig.1. The goal of Actor is to generate an action based on the given state. For the DRRS, parameter is generated and later applied in the ranking part, shown in Equation (1). , )= ( Where denote the state representation of the incoming user at time t. represents the parameter of Actor network (∙) stands for the network function of the Actor. The and ∈ × is a k-dimensional real-valued output parameter vector. Besides, the widely-used ε-greedy [16] exploration is applied by adding Gaussian noise to the . ∗ ∗ ( )| = (4) [ + γ ∗ ( , | , )] (5) ) = , , , [( − ( , ; )) ] (6) C. Environment Simulator In a commercial system, the exploration part of the DRL may bring the unstable model performance, which results in millions of dollars loss. Therefore, training model on offline log file is an ideal solution. To train the proposed model offline, we need to build an online environment simulator to generate users’ feedback. When the platform recommends item to a user, if the record when the is in the interaction log file, the is set to feedback is positive, otherwise 0. If this record is not in the interaction log file, in this paper, we use the statistic popularity-based information of each item to predict user feedback. To summarize, the environment simulator can be defined as Equation (7). (2) The parameter α learned in Actor network is applied to define , shown in Equation (3). = ( ) ∇ Where represents the parameter of Critic network and ∗ ( [ + γ , | , )] denotes the target state- = action value for this iteration. 2) Ranking part: We first define a new ranking function ( ) to determine which items are recommended , to the users. In this paper, the new ranking function is defined as Equation (2). = = , = Deep neural network is applied to learn the non-linear relationships of action-value function. The goal of Critic network is to minimize the mean square error between the output value and the target value calculated by bellman function in Equation (6). (1) ( ) ( , ) | We follow the standard assumption that delayed reward should be discounted by a factor γ for each time step. The target action-value function should follow the Bellman equation [19] as Equation (5). The network structure of DRRS is illustrated in Fig. 1. It can be observed that proposed framework consists of three key ingredients, Actor network, Ranking part and Critic network. Next, we will present the details of these three main ingredients: = 1 ≈ ∑ ∇ (3) 24

International Core Journal of Engineering 2020-26 | Page 46