The ML data validation activity shall check that the three generated data sets are sufficient to meet the ML data requirements. The results of the data validation activity shall be explicitly documented ([S]). Data validation shall consider the relevance, completeness, and balance of the data sets.
Discrepancies identified between the data generated and the ML data requirement shall be justified. These justifications shall be captured as part of the data validation results ([S]).
Both financial and practical concerns can lead to data sets that are not ideal and, in such cases, a clear rationale shall be provided. For example, a young child crossing in front of a fast-moving car may be a safety concern but gathering data for such events is not practicable.
The JAAD open dataset [56] is used as development data for an ML component used to detect pedestrians. The cost of gathering and processing data for road crossings is expensive and substantial effort has been undertaken to generate the JAAD dataset. The labelling of pedestrians and range of poses observed is extensive and is clearly relevant for a perception pipeline concerned with the identification of pedestrians. The range of crossings types observed is limited however and a justification may be required as to why this is relevant for the intended deployment.
Validation of data relevance shall consider the gap between the samples obtained and the real-world environment in which the system is to be deployed. Validation shall consider each of the sub‐activities undertaken in data generation and provide a clear rationale for their use.
Any simulation used for data augmentation is necessarily a simplification of the real world with assumptions underpinning the models used. Validating relevance, therefore, requires the gaps between simulation and modelling to be identified and a demonstration that these gaps are not material in the construction of a safe system.
Validation should demonstrate that context‐specific features defined in the ML safety requirements are present in the collected datasets. For example, for a pedestrian detection system for deployment on European roads the images collected should include road furniture of types that would be found in the anticipated countries of deployment.
Data gathered in US hospitals used for a UK prognosis system should state how local demographics, policies and equipment vary between countries and the impact of such variance on data validity.
When data is collected using controlled trials (e.g. for medical imaging) a decision may be made to collect samples using a machine set up away from the hospital using non‐medical staff. The samples may only be considered relevant if an argument can be made that the environmental conditions do not impact the samples obtained and that the experience of the staff has no effect on the samples collected.
Validation of data completeness shall demonstrate that the collected data covers all the dimensions of variation stated in the ML safety requirements sufficiently. Given the combinatorial nature of input features, validation shall seek to systematically identify areas that are not covered.
As the number of dimensions of variability and the granularity with which these dimensions are encoded increases, so the space that must be validated increases, combinatorially.
For continuous variables, the number of possible values is infinite. One possible approach is to use quantisation to map the continuous variables to a discrete space which may be more readily assessed. Where quantisation is employed it should be accompanied by an argument concerning the levels used.
Consider a system to identify road signs into 43 separate classes. Dimensions of variability are: weather, time of day, and levels of partial occlusion up to 70%.
Let us assume that we have categorised each dimension as:
Validation may show that there are samples for each of the 43 * 7 x 7 x 8 = 16856 possible combinations. A systematic validation process will identify that the datasets are missing (e.g. no samples containing a 40mph sign in light rain with 50% occlusion early in the morning). Although for most practical systems completeness is not possible, this process should provide evidence of those areas which are incomplete and why this is not problematic for assuring the resultant system.
Validation of data balance shall consider the distribution of samples in the data set. It is easiest to consider balance from a supervised classification perspective where the number of samples associated with each class is a key consideration.
At the class level assessing balance may be a simple case of counting the number of samples in each class. This approach becomes more complex, however, when considering the dimension of variation where specific combinations are relatively rare. More generally, data validation shall include statements regarding class balance and feature balance supervised learning tasks.
Certain classes may naturally be less common and, whilst techniques such as data augmentation may help, it may be difficult, or even impossible, to obtain a truly balanced set of classes. In such cases, the imbalance shall be noted and a justification provided as part of the validation results to support the use of imbalance data in the operational context.
Using the previous example, we may count the number of samples at each level of occlusion to ensure that each level is appropriately represented in the data sets.
Validation of data accuracy shall consider the extent to which the data samples, and meta data added to the set during preprocessing (e.g. labels), are representations of the ground truth associated with samples. Evidence supporting the accuracy of data may be gathered through a combination of the following:
Where existing data sets are re‐used (e.g. the JAAD pedestrian data set [56]), documentation concerning the process may be available. Even under these conditions, additional validation tasks may be required to ensure that the labels are sufficient for the context into which the model is to be deployed.