Having undertaken verification activities, ML verification evidence should be collated and reported in terms that are meaningful to the safety engineer with respect to the ML safety requirements and the operating environment. The verification evidence shall be comprehensive and shall clearly demonstrate coverage with respect to the dimensions of variability, and combinations thereof, relevant to the ML safety requirements.
Examples of verification evidence from testing:
DeepXplore [51] and DeepTest [63] provide evidence that can be thought of as similar to software coverage tests for deep neural networks. This technique considers the firing of neurons in a network when exposed to the data samples in the training set. While this is not strictly analogous to traditional coverage measures it allows for the identification of those input samples which are ‘corner‐cases’ with respect to the training data.
DeepRoad [68] utilises Generative Adversarial Network (GAN) based techniques to synthesize realistic and diverse driving scenarios in order to find inconsistencies in autonomous driving systems. The evidence should enumerate the scenarios examined and the results of the model when presented with these samples as well as the ground truth labels.
Examples of verification evidence from formal verification:
Local robustness property [30]: given a deep neural network and an input region defined by a data sample, local robustness defines a region around the input for which the output of the function defined by the neural network remains unaltered. Evidence is then the samples verified and the dimension of the region for which robustness is proved.
DeepSafe [24] identifies ‘safe regions’ in the input space for classification problems. The input space is first partitioned into regions that are likely to have the same label. These regions are then checked for targeted robustness (i.e. no point in the space can be mapped to a specific incorrect label). More concretely, when considering an unmanned aircraft control system, DeepSafe was able to identify regions for the inputs to the system for which a ‘weak left’ turn advisory would always be returned.