International Core Journal of Engineering 2020-26

IV. E XPERIMENT In this section, the details of datasets and evaluation metric are given at the beginning. Then we introduce three traditional ranking strategies and compare the performance between the proposed method DRRS and baseline methods. The convergence curve is shown to demonstrate the correctness of the proposed model. Finally, the impact of different hyper- parameters on the performance of the proposed model is explored. (7) Where the denotes the average positive feedback rate of i-th item in the records. It’s commonly believed that the average results obtained from a large number of records should be close to the expected value and can be regarded as the simulated reward. A. Dataset and Evaluation Metric We perform our experiments on the following publicly available datasets: D. Offline Reinforcement Learning With the interaction history of the user and recommender, we can train the DRRS and complete the recommendation task. Experiment replay [20] is used in training the DRRS, which means the whole process consists of two stages: transition storing and model training stage. Moreover, separated evaluation and target networks [21] is introduced to help smooth the learning and avoid the divergence of the parameters. x MovieLens (100k). Containing 0.1 million ratings from 943 users on 1682 movies from MovieLens website. x MovieLens (1M). A benchmark recommender system dataset including 1 million ratings from 6040 users on 3952 movies. , the For transition storing stage, given the state recommender recommends an item with the highest score, where the score is calculated by the output of Actor network . Then the reward is calculated by the environment is generated according to the simulator and next state user feedback. Finally the recommender agent stores the > into replay memory. For model transitions < , , , training stage, the recommender samples minibatch of transitions and feeds them into the model to train the model under an off-policy manner. The whole training procedure of proposed method is shown in Algorithm 1. All the ratings are ranging from {1,2,3,4,5} on both datasets. We regard the feedback with rating larger than three as positive feedback, otherwise negative feedback. The bid price for each item is simulated by sampling from a uniform distribution between 0.1 and 1.0. The evaluation metric is defined as the total cumulative rewards of T step on test dataset. Parameter setting: The discount factor is set to 0.9. The exploration rate is set to 0.1 and the μ and ε are set to 0 and 0.1 respectively. The network structure of Actor and Critic is 640-500-300-64 and 640-564-300-1. The activation function is relu and batch normalization is used after each layer except for the output layer. Adam with default parameters and batch size 64 are applied in the training stage. The dimension of item embedding is set to 64. The training and evaluation step of DRRS are both set to 10 by grid search of {5, 10, 15, 20, 40}. The dataset is randomly split into 80% training and 20% test set. We report the average recommendation performance by conducting DRRS 30 times to reduce the randomness. Algorithm 1: Training procedure of DRRS Input: batch size: N, discount factor γ, reward function R, learning rate of Actor: , learning rate of Critic: . number of session: S, number of learning step: T, update rate: . and Critic network with 1. Initialize Actor network random weights 2. Initialize target network and with weight ← ˈ ← B. Performance Comparision 3. Initialize the replay buffer D. 4. for session = 1, S do 5. Receive the initial state of user 6. for t = 1, T do 7. Select an action according to the policy of Actor 8. Recommend the item with highest ranking score calculated by (2). 9. Observe the user feedback after recommendation 10. Generate new state according to the feedback 11. Store transitions < , , , > in D 12. Sample N transitions < , , , ′> from D ∗ ( 13. Set y = [ + γ , | , )] 14. Update Actor by using sampled policy gradient 15. Update Critic by minimizing the loss in (6) 16: Update the target network: ← + (1 − ) ← + (1 − ) TABLE I. P ERFORMANCE COMPARISION OF ALL METHODS Model Cumulative Reward of 10 steps ML(100K) ML(1M) Baseline_1 1698.1 1579.5 Baseline_2 1785.2 1694.2 Baseline_3 1801.9 1685.3 DRRS 2097.4 1893.2 We compare the performance of the proposed method DRRS and baseline algorithms. All compared methods can be ∗ . Three baseline summarized into same pattern: traditional ranking strategies include: (1) baseline_1: = 1, (2) baseline_2: is set to the real number which maximizes the cumulative revenue by grid search in the training set, (3) baseline_3: same as the second baseline method, except that can be set to the different value for different user. Proposed DRRS applies DRL method to obtain . Table Ⅰ summarizes the performance of these models on ML(100k), ML(1M) datasets, and bold numbers are the best results among these models. It can be observed that DRRS 25

International Core Journal of Engineering 2020-26 | Page 47