International Core Journal of Engineering 2020-26

2019 International Conference on Artificial Intelligence and Advanced Manufacturing (AIAM) A Novel Method to Improve Hit Rate for Big Data Quick Reading Xiaobo Zhang Xinxin Zhou Zhaohui Zhang Computer Science and Technology Donghua University Shanghai, China e-mail: [email protected] Computer Science and Technology Donghua University Shanghai, China e-mail: [email protected] Computer Science and Technology Donghua University Shanghai, China e-mail: [email protected] Lizhi Wang Pengwei Wang Computer Science and Technology Donghua University Shanghai, China e-mail: [email protected] Computer Science and Technology Donghua University Shanghai, China e-mail: [email protected] strategy. Abstract—In big data mining analysis, the data records in the dataset are randomly retrieved. The distributed storage modes, such as BigTable, HBase, provide the cache policy for file blocks in retrieval operations. Since these records are scattered in different file blocks, the block cache does not have a high hit rate. To deal with the above problem, we propose an LRU-based double queue K-frequency cache method (DLK). The method presents a double queue storage structure, applying different storage and eviction rules for the data with varying access frequency (i.e., high/low access frequency). While the method divides the memory into data area and list area and adopts different data structure to reduce the time of data retrieval and data processing. The experimental results show that proposed method can reduce retrieval time by 30% with the cache mechanism. Compared with existing methods DLK can improve the hit rate by 60.1% and reduce the retrieval time by 43.5%. While applying in smaller cache capacity, our method outperforms other algorithms. Keywords—Distributed double-queue cache, replacement, To solve the above problem, we study and establish a record-oriented cache structure in a distributed storage model, and propose an LRU-based double queue K-frequency cache method (DLK). Extensive experiments are performed to show the effectiveness of the proposed method. The contribution of this paper is summarized as follows: (1) We establish a record-oriented cache structure in distributed storage model. (2) A record-oriented cache method is proposed. The method uses a dual LRU queue to distinguish the data access frequency, and the cached data is further divided into high access frequency data and low access frequency data. (3) The memory is divided into data area and list area, and the cache data are stored in different area respectively. It can improve the speed of the data retrieval and the efficiency of replacement frequency, The rest of the paper is organized as follows. In Section 2, related work is discussed. The problem is defined, and the DLK method is proposed in Section 3. After that, Section 4 shows the result of experiment. Finally, we conclude the paper in Section 5. I. I NTRODUCTION In the mining analysis of big data, the data usually has the characteristics of hot and cold spot: 20% of the data is frequently accessed during the retrieval process, and the remaining 80% of the data is less frequently accessed. In order to improve the performance of retrieval of data and reduce I/O operations of disks, the hot spot data is generally replicated and stored in memory during data storage which is called cache. Distributed storage modes such as BigTable [1], HBase [2] provide cache for file blocks. For example, HBase uses the local file system or the HDFS of Hadoop [3] as the storage support, and caches the file block into the BlockCache [4] to improve the retrieval efficiency. II. R ELATED W ORK Now, cache is widely applied to the operating system, CPU, database, CDN and other fields. A large number of scholars have studied the replacement algorithm, and classic replacement algorithms are LRU, FIFO, OPT [5] and so on, among which LRU is the most commonly used replacement algorithm. However, there are some shortcomings in LRU algorithm. When access sequence has a looping feature and access interval is greater than cache capacity, the page jitter may cause lower hit ratio. What's more, the algorithm of LRU only focuses on the time of data retrieval rather than data frequency, which is an obstacle for applying LRU widely. Referred to big data mining analysis, the data retrieving is usually for records, scattered and random. The cache of file blocks in the sequential read environment can bring a high hit rate and improve the efficiency of query. However, when faced with the decentralized and random read environment, it is not an efficient and suitable for the file block-based cache 978-1-7281-4691-1/19/$31.00 ©2019 IEEE DOI 10.1109/AIAM48774.2019.00015 Based on LRU algorithm, LRU-K [6] algorithm adds an retrieval-history sequence, in which the sequence doesn’t 39

International Core Journal of Engineering 2020-26 | Page 61