International Core Journal of Engineering 2020-26 | Page 128

rate and then decide which cellphone is the optimal choice; 7) relying on the result predicted, we confirm the specific cellphone, to a specific model of a cellphone brand. P i G ( x i ), i  T Where T is the count of features of the target consumers in the testing dataset. B. Training Data and Prediction Data The survey data including user’s personal information, cellphone features and cellphone usage of all the users are used in machine learning models training. The data including people who are willing to switch their cellphone are used as in prediction. When we predict, the feature of certain cellphone uses fixed values. 1 represents a used cellphone and 0 represents a never used cellphone. Then the purchase rate values returned can be denoted as P (1) and P (0). F. Cellphone Determination After obtaining consumers’ purchase rate for each cellphone, we can determine the desired cellphones for consumers. Supposing the predicted purchase rate for all cellphones are denoted as set P, and the corresponding cellphones are denoted as set S. To guarantee a low time complexity, we apply a heapsort algorithm to determine the suitable cellphones, as shown in Fig. 2. The inputs are set S, and set P. This algorithm assigns appropriate cellphones to consumers, which answers our research question 4. We obtain the training data from online questionnaires and paper questionnaires given to cellphone stores browsers and students in universities. There are tens of thousands of items, and each item contains 148 features and 1 label that specifies whether the consumer made any purchase of specific cellphones or not. The features can be divided into 2 groups: 1) personal information; 2) cellphone information. These 2 groups are independent from each other. For the labeling of the surveyed training data, we use the label 0 to represent an unpurchased status to the specified cellphone and we use label 1 to represent the purchased. By reviewing the items, we obtain the prediction data from people who are willing to make purchases toward a new cellphone. The number of such consumers is a few thousand. C. Feature Analysis Although all the features are closely related to the labeling, not all of them are of the same importance in contributing to the labeling. To understand the characteristics of every feature, we apply the feature importance assessment using Random Forest. We list the top 5 feature’s importance as follows:1) Cellphone prices as a percentage of income-30%~50%: 4.24%; 2) Cellphone use-primary use: 3.90%; 3) Cellphone use-mobile game: 3.35%; 4) Cellphone demand- Bluetooth: 2.82%; 5) Purchase advice-colleagues: 2.56%. The overall score is 16.87%. D. Training Model Selection and Parameter Tuning We decide to investigate two training model candidates: Random Forest (RF), Gradient Boosting Decision Tree (GBDT) and we choose the better one of them. We also select two existing old methods in comparison, and they are: Artificial Neural Network (ANN) and Naïve Bayes (NB). Fig. 2. A cellphone determination algorithm The model training can be represented as: G ( x ) V. E XPERIMENTAL E VALUATION We implement CIMR and evaluate it with our experiments. In this section, we clarify the experimental environment, the experimental procedure, the experimental results and the discussion. g ( x 1 , x 2 ,..., x n ), n 148 Where n means feature counts, g (.) is the function that represents the model used for prediction, and x n denotes each feature for every consumer. G(x) represents the training result. This answers the research question 3. A. Experimental Environment The experimental environment is shown in Fig. 3. The training data is stored in the SQL server database, as well as prediction data. Then, the training data and the prediction data are passed to the distributed machine learning platform. E. Data Prediction We use the result from the model to calculate the purchase rate of each consumer in the testing dataset. The features of each consumer in the testing dataset are denoted as x i , and the trained model can be denoted as G(x i ), so the purchase rate of the consumers, denoted as: The machine learning model is generated based on the training data, and the prediction data will be inputs of the machine learning model for purchase rate prediction. We employ our cellphone determination engine to figure out the appropriate models of cellphone that satisfy requirements for 106