International Core Journal of Engineering 2020-26 | Page 128
rate and then decide which cellphone is the optimal choice; 7)
relying on the result predicted, we confirm the specific
cellphone, to a specific model of a cellphone brand.
P i
G ( x i ), i T
Where T is the count of features of the target consumers
in the testing dataset.
B. Training Data and Prediction Data
The survey data including user’s personal information,
cellphone features and cellphone usage of all the users are
used in machine learning models training. The data
including people who are willing to switch their cellphone
are used as in prediction.
When we predict, the feature of certain cellphone uses
fixed values. 1 represents a used cellphone and 0 represents
a never used cellphone. Then the purchase rate values
returned can be denoted as P (1) and P (0).
F. Cellphone Determination
After obtaining consumers’ purchase rate for each
cellphone, we can determine the desired cellphones for
consumers. Supposing the predicted purchase rate for all
cellphones are denoted as set P, and the corresponding
cellphones are denoted as set S. To guarantee a low time
complexity, we apply a heapsort algorithm to determine the
suitable cellphones, as shown in Fig. 2. The inputs are set S,
and set P. This algorithm assigns appropriate cellphones to
consumers, which answers our research question 4.
We obtain the training data from online questionnaires
and paper questionnaires given to cellphone stores browsers
and students in universities. There are tens of thousands of
items, and each item contains 148 features and 1 label that
specifies whether the consumer made any purchase of
specific cellphones or not. The features can be divided into
2 groups: 1) personal information; 2) cellphone information.
These 2 groups are independent from each other.
For the labeling of the surveyed training data, we use the
label 0 to represent an unpurchased status to the specified
cellphone and we use label 1 to represent the purchased. By
reviewing the items, we obtain the prediction data from
people who are willing to make purchases toward a new
cellphone. The number of such consumers is a few thousand.
C. Feature Analysis
Although all the features are closely related to the
labeling, not all of them are of the same importance in
contributing to the labeling. To understand the characteristics
of every feature, we apply the feature importance assessment
using Random Forest.
We list the top 5 feature’s importance as follows:1)
Cellphone prices as a percentage of income-30%~50%:
4.24%; 2) Cellphone use-primary use: 3.90%; 3) Cellphone
use-mobile game: 3.35%; 4) Cellphone demand- Bluetooth:
2.82%; 5) Purchase advice-colleagues: 2.56%. The overall
score is 16.87%.
D. Training Model Selection and Parameter Tuning
We decide to investigate two training model candidates:
Random Forest (RF), Gradient Boosting Decision Tree
(GBDT) and we choose the better one of them. We also
select two existing old methods in comparison, and they are:
Artificial Neural Network (ANN) and Naïve Bayes (NB).
Fig. 2. A cellphone determination algorithm
The model training can be represented as:
G ( x )
V. E XPERIMENTAL E VALUATION
We implement CIMR and evaluate it with our
experiments. In this section, we clarify the experimental
environment, the experimental procedure, the experimental
results and the discussion.
g ( x 1 , x 2 ,..., x n ), n 148
Where n means feature counts, g (.) is the function that
represents the model used for prediction, and x n denotes
each feature for every consumer. G(x) represents the
training result. This answers the research question 3.
A. Experimental Environment
The experimental environment is shown in Fig. 3. The
training data is stored in the SQL server database, as well as
prediction data. Then, the training data and the prediction
data are passed to the distributed machine learning platform.
E. Data Prediction
We use the result from the model to calculate the
purchase rate of each consumer in the testing dataset. The
features of each consumer in the testing dataset are denoted
as x i , and the trained model can be denoted as G(x i ), so the
purchase rate of the consumers, denoted as:
The machine learning model is generated based on the
training data, and the prediction data will be inputs of the
machine learning model for purchase rate prediction. We
employ our cellphone determination engine to figure out the
appropriate models of cellphone that satisfy requirements for
106