Securing the ML Lifecycle
We can already observe that several stakeholders are interacting here , though we left it open who trains the model and how the raw supplier data ( images and technical documentation ) is transferred into the ML pipeline .
A hospital is using a pre-trained commercial model to analyze medical images on their network . However , the hospital also wants to use their own databases to improve the model . Data from the radio-diagnostics department is sent to a cloud platform where a commercial provider has set up their training infrastructure . Models are retrained and updated on a regular basis .
Again , at least two stakeholders are involved : the actual end user of the model and the company providing the initial model . The case again leaves it open how the existing model is retrained and how the required data is transferred .
As for any software solution , training an ML model should serve a defined business objective and the non-functional requirements derived from it . Only if these are clearly defined can the required training data be gathered , and the correct training parameters be defined .
When collecting data , the two core parameters are quantity and quality . This specifically concerns monitoring the balance of data to avoid a later bias ( e . g ., in medical data 7 ). Another concern is where the data is generated . One possibility is that this is done within the full control of the organization that defines the training process . However , it could also be data that is obtained from public repositories or from contractual agreements with third parties or customers .
In an initial data analysis , the collected data is examined with respect to its structure , data types and categories , possible outliers , or possible immediate correlations . This is then usually followed by a preprocessing step to , for example , remove any statistical noise or zero values . The features ( column names ) used for the later training are identified , and the data is split into a training , validation , and testing set .
The training phase usually starts with a reasoned choice ( configuration ) of suitable training algorithms and associated parameters ( e . g ., number of hidden layers , batch size , or learning rate ). “ Suitable ” in this context should imply that the training approach fits the defined business objective . The training data is then fed into this defined training pipeline , resulting in incrementally validated and adjusted training parameters and intermediate models .
7
Vokinger , K . N ., Feuerriegel , S ., & Kesselheim , A . S . “ Mitigating bias in machine learning for medicine .” Commun Med 1 , 25 ( 2021 ).
40 March 2020