Each candidate model created in Activity 5 shall be evaluated using the internal test data ([H]) to check that it is able to satisfy the ML safety requirements. The internal test data shall not have been used during Activity 5 in creating the candidate model. Allowing the development process to have a view of the internal test data is known as data leakage in machine learning [5].
As shown in Figure 10 in Activity 10, the model development stage is iterative and the model creation and model testing activities may be performed many times creating different models, which will be evaluated in order to find the best one. If it is not possible to create a model that meets the ML safety requirements with internal test data, the data management stage (Stage 3) and/or the ML requirements stage (Stage 2) shall be revisited in order to create an acceptable model. Unlike traditional software testing, it is challenging to understand how an ML model can be changed to solve problems encountered during testing. The model development log [U] may provide insights to aid the developer to improve the model.
In testing the model we find the accuracy is lower than expected indicating that the model fails to generalise beyond the development data. An analysis of the images that were incorrectly classified showed that images with bright sunlight have a higher failure rate than other images in the test set. This might dictate that we should return to the data management stage and collect additional images for this mode of failure.
The results of the internal testing of the model shall be explicitly documented ([X]).
A model shall be selected from the valid candidate models that have been created. The selected model ([M]) shall be the one that best meets the different, potentially conflicting, requirements that exist. This is a multi‐objective optimisation problem where there could be multiple models on the pareto‐front and it is important to select the best threshold to satisfy our requirements.
A model to be deployed in a perception pipeline classifies objects into one of ten classes. A set of ML safety requirements are defined in terms of the minimum accuracy for each class. The model development process returns five models, each of which has accuracy greater than this minimum, however, each performs better with respect to one particular class. Under such conditions choosing the ‘best’ model requires the user to make a trade‐off between class accuracies. Furthermore, this trade‐off may change as we move from rural to urban contexts.