International Core Journal of Engineering 2020-26 | Page 47
IV. E XPERIMENT
In this section, the details of datasets and evaluation metric
are given at the beginning. Then we introduce three traditional
ranking strategies and compare the performance between the
proposed method DRRS and baseline methods. The
convergence curve is shown to demonstrate the correctness of
the proposed model. Finally, the impact of different hyper-
parameters on the performance of the proposed model is
explored.
(7)
Where the
denotes the average positive feedback rate
of i-th item in the records. It’s commonly believed that the
average results obtained from a large number of records
should be close to the expected value and can be regarded as
the simulated reward.
A. Dataset and Evaluation Metric
We perform our experiments on the following publicly
available datasets:
D. Offline Reinforcement Learning
With the interaction history of the user and recommender,
we can train the DRRS and complete the recommendation task.
Experiment replay [20] is used in training the DRRS, which
means the whole process consists of two stages: transition
storing and model training stage. Moreover, separated
evaluation and target networks [21] is introduced to help
smooth the learning and avoid the divergence of the
parameters.
x MovieLens (100k). Containing 0.1 million ratings
from 943 users on 1682 movies from MovieLens
website.
x MovieLens (1M). A benchmark recommender system
dataset including 1 million ratings from 6040 users on
3952 movies.
, the
For transition storing stage, given the state
recommender recommends an item with the highest score,
where the score is calculated by the output of Actor network
. Then the reward
is calculated by the environment
is generated according to the
simulator and next state
user feedback. Finally the recommender agent stores the
> into replay memory. For model
transitions < , , ,
training stage, the recommender samples minibatch of
transitions and feeds them into the model to train the model
under an off-policy manner. The whole training procedure of
proposed method is shown in Algorithm 1.
All the ratings are ranging from {1,2,3,4,5} on both
datasets. We regard the feedback with rating larger than three
as positive feedback, otherwise negative feedback. The bid
price for each item is simulated by sampling from a uniform
distribution between 0.1 and 1.0. The evaluation metric is
defined as the total cumulative rewards of T step on test
dataset.
Parameter setting: The discount factor is set to 0.9. The
exploration rate is set to 0.1 and the μ and ε are set to 0 and
0.1 respectively. The network structure of Actor and Critic is
640-500-300-64 and 640-564-300-1. The activation function
is relu and batch normalization is used after each layer except
for the output layer. Adam with default parameters and batch
size 64 are applied in the training stage. The dimension of item
embedding is set to 64. The training and evaluation step of
DRRS are both set to 10 by grid search of {5, 10, 15, 20, 40}.
The dataset is randomly split into 80% training and 20% test
set. We report the average recommendation performance by
conducting DRRS 30 times to reduce the randomness.
Algorithm 1: Training procedure of DRRS
Input: batch size: N, discount factor γ, reward function R,
learning rate of Actor: , learning rate of Critic: . number
of session: S, number of learning step: T, update rate: .
and Critic network
with
1. Initialize Actor network
random weights
2. Initialize target network and with weight
← ˈ
←
B. Performance Comparision
3. Initialize the replay buffer D.
4. for session = 1, S do
5. Receive the initial state of user
6. for t = 1, T do
7.
Select an action according to the policy of Actor
8.
Recommend the item with highest ranking score
calculated by (2).
9.
Observe the user feedback after recommendation
10. Generate new state
according to the feedback
11. Store transitions < , , ,
> in D
12. Sample N transitions < , , , ′> from D
∗ (
13. Set y =
[ + γ
, | , )]
14. Update Actor by using sampled policy gradient
15. Update Critic by minimizing the loss in (6)
16: Update the target network:
←
+ (1 − )
←
+ (1 − )
TABLE I. P ERFORMANCE COMPARISION OF ALL METHODS
Model
Cumulative Reward of 10 steps
ML(100K) ML(1M)
Baseline_1 1698.1 1579.5
Baseline_2 1785.2 1694.2
Baseline_3 1801.9 1685.3
DRRS 2097.4 1893.2
We compare the performance of the proposed method
DRRS and baseline algorithms. All compared methods can be
∗
. Three baseline
summarized into same pattern:
traditional ranking strategies include: (1) baseline_1: = 1,
(2) baseline_2: is set to the real number which maximizes
the cumulative revenue by grid search in the training set, (3)
baseline_3: same as the second baseline method, except that
can be set to the different value for different user. Proposed
DRRS applies DRL method to obtain .
Table Ⅰ summarizes the performance of these models on
ML(100k), ML(1M) datasets, and bold numbers are the best
results among these models. It can be observed that DRRS
25