International Core Journal of Engineering 2020-26 | Page 46
Fig. 1. Network structure of DRRSU
Where denotes the embedding of -th item pre-trained
by Probabilistic Matrix Factorization (PMF) [17]. The
is equal to the dimension of item
dimension of
embedding. In a recommendation session, the ranking score
of each item is calculated according to Equation (2), items
with the top-ranking scores are recommended to the user.
3) Critic Network: The network structure of Critic, which
is shown in the right part of Fig.1, can be regarded as a Deep
Q-network. The state-action value (s,a) is learned to
estimate the cumulative rewards for the given state and action.
The network parameters of Actor part are updated in the
direction of improving Q value, namely to select the best
action for the given state. According to the deterministic
policy gradient theorem [18], Actor network is updated by
sampled policy gradient, shown in Equation (4).
Fig. 2. ser-agent interaction in MDP
B. Proposed Framework
As mentioned in Section Ⅰ, traditional ranking strategies
only consider immediate reward. We propose a Deep
Reinforcement learning based Ranking Strategy (DRRS) to
learn the parameter in ranking function which maximizes the
cumulative reward of the platform. After obtaining the α from
the Actor network, the ranking score of each item can be
calculated and the platform recommends the items with top-
ranking score to the current user.
∇
∗ (
1) Actor Network: The network structure of Actor is
shown in the left part of Fig.1. The goal of Actor is to generate
an action based on the given state. For the DRRS, parameter
is generated and later applied in the ranking part, shown in
Equation (1).
, )=
(
Where denote the state representation of the incoming
user at time t. represents the parameter of Actor network
(∙) stands for the network function of the Actor. The
and
∈ × is a k-dimensional real-valued
output parameter
vector. Besides, the widely-used ε-greedy [16] exploration is
applied by adding Gaussian noise to the .
∗
∗
( )|
=
(4)
[ + γ
∗ (
,
| , )]
(5)
) =
, , ,
[( − ( , ;
)) ]
(6)
C. Environment Simulator
In a commercial system, the exploration part of the DRL
may bring the unstable model performance, which results in
millions of dollars loss. Therefore, training model on offline
log file is an ideal solution.
To train the proposed model offline, we need to build an
online environment simulator to generate users’ feedback.
When the platform recommends item to a user, if the record
when the
is in the interaction log file, the is set to
feedback is positive, otherwise 0. If this record is not in the
interaction log file, in this paper, we use the statistic
popularity-based information of each item to predict user
feedback. To summarize, the environment simulator can be
defined as Equation (7).
(2)
The parameter α learned in Actor network is applied to
define , shown in Equation (3).
=
( ) ∇
Where
represents the parameter of Critic network and
∗ (
[ + γ
, | , )] denotes the target state-
=
action value for this iteration.
2) Ranking part: We first define a new ranking function
(
) to determine which items are recommended
,
to the users. In this paper, the new ranking function is defined
as Equation (2).
=
= , =
Deep neural network is applied to learn the non-linear
relationships of action-value function. The goal of Critic
network is to minimize the mean square error between the
output value and the target value calculated by bellman
function in Equation (6).
(1)
( )
( , ) |
We follow the standard assumption that delayed reward
should be discounted by a factor γ for each time step. The
target action-value function should follow the Bellman
equation [19] as Equation (5).
The network structure of DRRS is illustrated in Fig. 1. It
can be observed that proposed framework consists of three key
ingredients, Actor network, Ranking part and Critic network.
Next, we will present the details of these three main
ingredients:
=
1
≈ ∑ ∇
(3)
24