International Core Journal of Engineering 2020-26 | Page 45
x A Deep Reinforcement learning based Ranking
Strategy (DRRS) is proposed to maximize the
cumulative reward of the platform. The ranking
∗
and DDPG is
function is redefined as
utilized to learn . in recommender system. However, most existing DRL
methods consider the platform is non-profitable and some
methods do not stand on the side of platform. In our paper, we
propose a DRL-based framework DRRS to maximize the
cumulative revenue of the platform.
x A recommender system simulator is built to obtain
immediate feedback to avoid hurting the performance
of the commercial platform. III. M ETHOD
In this section, the definition of recommendation task is
first introduced, and then a DRL based framework DRRS is
proposed for recommendation. Finally, a recommendation
interaction simulator is designed and the details of offline
learning procedure are illustrated.
x Experiments conducted on real-world datasets further
demonstrate the proposed method performs much
better than the traditional ranking algorithms.
A. Preliminaries
The recommendation task is defined as the platform
chooses recommended items over a sequence of time steps
and charges the advertiser for the positive feedback (click).
is given by the corresponding
The bid price of -th item
provided as
advertiser. The predicted CTR of -th item
an input is computed by the online predicted model (e.g., LR).
Traditional recommendation methods apply greedy ranking
∗
, which represents for expected
function
revenue of recommending -th item, to maximize the
immediate reward. Taking the cumulative reward into account,
,
) and the
we redefine the ranking function as (
goal of proposed model is to learn the ω which maximizes the
cumulative revenue of the platform.
The rest of this paper is organized as follows. Related work
is illustrated in Section Ⅱ. The proposed method DRRS is
introduced in Section Ⅲ. After that, the experiment results are
shown in Section Ⅳ. Section Ⅴ concludes this paper and
proposes future work.
II. R ELATED WORK
The auction mechanism workflow in recommender system
can be described as: each advertiser provides a bid price to the
platform for an advertisement response (e.g., click/download
in App Store). Then the platform orders the advertisements
based on a pre-defined ranking function and presents the
advertisements with the top-ranking scores to the users.
Finally the platform charges the advertisers the bid price for
the positive responses of users. In our work, the generalized
first price (GFP) mechanism is applied for charging and the
goal is to maximize the long-term revenue for the platform.
Traditional ranking strategies only consider immediate reward
and apply greedy ranking strategies.
In this paper, a Deep Reinforcement learning based
Ranking Strategy (DRRS) is proposed to obtain ω . To
formulate the recommendation procedure as an RL task, we
firstly model the user-item interactions as a Markov Decision
Process (MDP), which consists of a sequence of states, actions
and rewards. The fundamental elements of MDP formulation
of DRRS are defined.
Reinforcement learning (RL) is always modeled as
Markov Decision Process (MDP) [8] and concerned with how
agents ought to take actions in an environment to maximize
some pre-defined cumulative rewards. Since that simple RL
methods (Q-learning, SARSA) can’t deal with the task with
continuous state space or action space, deep reinforcement
learning (DRL) is proposed. DRL extends reinforcement
learning to the entire process from high-dimensional
observations to continuous actions by forming it using a deep
network and without explicitly designing state space or action
space. DRL has been applied successfully to various problem,
including game of go [10], robot control [11], video games
[12], etc. As an important model for DRL, DDPG used in this
paper is an Actor-Critic algorithm which generates an action
based on the given state in Actor part and approximates the
optimal state-action value function in Critic part. However,
most existing DRL methods for recommendations [13, 14]
assume that the platform is non-profitable, which means the
reward of these models does not contain the revenue of the
platform. These methods only consider the ranking quality.
Reference [15] considers the platform revenue, but its central
idea is to find a proper ranking function which can
simultaneously satisfy the platform, the users and the
advertisers rather than to maximize the platform revenue.
is defined as the user’s click
x State : The state
history sorted in chronological order before time t.
x Action α : The action α is defined as a continuous
vector. The ω involved in the ranking function is
defined as the inner product of α and the item
embedding .
x Reward : After the platform receives the state and
selects an action to recommend items to users, the
reward is defined as the feedback provided by the
users, e.g., click/not click.
x Transition : As the state is defined as the click
history of the user, when the feedback is collected, the
next state is determined.
x Discount factor γ: γ ∈ [0,1] controls the importance
of future rewards. For example, if γ = 0, the model
only considers the immediate reward. When γ = 1, the
model considers the future reward and immediate
reward are equally important.
The user-agent interaction of recommender system is
shown in Fig. 2. The problem of recommendation task can be
defined as: given the interaction history of recommender-user
in MDP form, how to find a policy obtaining α to maximize
the cumulative reward of recommender system.
To summarize, traditional ranking strategies regard the
recommendation interactions as a static process and only
consider the immediate reward. To maximize the cumulative
reward, DRL is introduced to optimize the ranking functions
23