International Core Journal of Engineering 2020-26 | Page 61
2019 International Conference on Artificial Intelligence and Advanced Manufacturing (AIAM)
A Novel Method to Improve Hit Rate for Big Data
Quick Reading
Xiaobo Zhang Xinxin Zhou Zhaohui Zhang
Computer Science and Technology
Donghua University
Shanghai, China
e-mail: [email protected] Computer Science and Technology
Donghua University
Shanghai, China
e-mail: [email protected] Computer Science and Technology
Donghua University
Shanghai, China
e-mail: [email protected]
Lizhi Wang Pengwei Wang
Computer Science and Technology
Donghua University
Shanghai, China
e-mail: [email protected] Computer Science and Technology
Donghua University
Shanghai, China
e-mail: [email protected]
strategy.
Abstract—In big data mining analysis, the data records in
the dataset are randomly retrieved. The distributed storage
modes, such as BigTable, HBase, provide the cache policy for
file blocks in retrieval operations. Since these records are
scattered in different file blocks, the block cache does not have a
high hit rate. To deal with the above problem, we propose an
LRU-based double queue K-frequency cache method (DLK).
The method presents a double queue storage structure,
applying different storage and eviction rules for the data with
varying access frequency (i.e., high/low access frequency).
While the method divides the memory into data area and list
area and adopts different data structure to reduce the time of
data retrieval and data processing. The experimental results
show that proposed method can reduce retrieval time by 30%
with the cache mechanism. Compared with existing methods
DLK can improve the hit rate by 60.1% and reduce the
retrieval time by 43.5%. While applying in smaller cache
capacity, our method outperforms other algorithms.
Keywords—Distributed
double-queue
cache,
replacement,
To solve the above problem, we study and establish a
record-oriented cache structure in a distributed storage model,
and propose an LRU-based double queue K-frequency cache
method (DLK). Extensive experiments are performed to show
the effectiveness of the proposed method.
The contribution of this paper is summarized as follows:
(1) We establish a record-oriented cache structure in
distributed storage model.
(2) A record-oriented cache method is proposed. The
method uses a dual LRU queue to distinguish the data access
frequency, and the cached data is further divided into high
access frequency data and low access frequency data.
(3) The memory is divided into data area and list area, and
the cache data are stored in different area respectively. It can
improve the speed of the data retrieval and the efficiency of
replacement
frequency,
The rest of the paper is organized as follows. In Section 2,
related work is discussed. The problem is defined, and the
DLK method is proposed in Section 3. After that, Section 4
shows the result of experiment. Finally, we conclude the
paper in Section 5.
I. I NTRODUCTION
In the mining analysis of big data, the data usually has the
characteristics of hot and cold spot: 20% of the data is
frequently accessed during the retrieval process, and the
remaining 80% of the data is less frequently accessed. In order
to improve the performance of retrieval of data and reduce I/O
operations of disks, the hot spot data is generally replicated
and stored in memory during data storage which is called
cache. Distributed storage modes such as BigTable [1], HBase
[2] provide cache for file blocks. For example, HBase uses the
local file system or the HDFS of Hadoop [3] as the storage
support, and caches the file block into the BlockCache [4] to
improve the retrieval efficiency.
II. R ELATED W ORK
Now, cache is widely applied to the operating system,
CPU, database, CDN and other fields. A large number of
scholars have studied the replacement algorithm, and classic
replacement algorithms are LRU, FIFO, OPT [5] and so on,
among which LRU is the most commonly used replacement
algorithm. However, there are some shortcomings in LRU
algorithm. When access sequence has a looping feature and
access interval is greater than cache capacity, the page jitter
may cause lower hit ratio. What's more, the algorithm of LRU
only focuses on the time of data retrieval rather than data
frequency, which is an obstacle for applying LRU widely.
Referred to big data mining analysis, the data retrieving is
usually for records, scattered and random. The cache of file
blocks in the sequential read environment can bring a high hit
rate and improve the efficiency of query. However, when
faced with the decentralized and random read environment, it
is not an efficient and suitable for the file block-based cache
978-1-7281-4691-1/19/$31.00 ©2019 IEEE
DOI 10.1109/AIAM48774.2019.00015
Based on LRU algorithm, LRU-K [6] algorithm adds an
retrieval-history sequence, in which the sequence doesn’t
39