The ML data requirements shall specify the characteristics that the data collected must have in order to ensure that a model meeting the ML safety requirements may be created. ML data requirements shall include consideration of the relevance, completeness, accuracy and balance of the data [5]. These requirements shall explicitly state the assumptions made with respect to the operating environment and the data features required to encode the domain.
ML data requirements will often focus on specifying data that is necessary to ensure the robustness of the model in the context of the operational domain. This should relate to the dimensions of variation anticipated in the operational domain as enumerated in the ML safety requirements.
ML data requirements relating to relevance shall specify the extent to which the data must match the intended operating domain into which the model is to be deployed.
For an ML component used for object detection on a vehicle, the following may be defined as an ML data requirement for relevance: “Each data sample shall assume sensor positioning which is representative of that to be used on the vehicle”. This requirement is defined to ensure that images that provide a very low or very high viewpoint of the road (such as an aerial view) are not used in development.
For an ML component used for medical diagnosis based on X‐ray images for use in UK hospitals the following may be defined as an ML data requirement for relevance: “Each data sample shall be representative of those images gathered for machines of type A, B, C which are in use in UK hospitals”. This requirement ensures that image artefacts due to image processing in machines used outside the UK are not used in development.
ML data requirements relating to completeness shall specify the extent to which the development data must be complete with respect to a set of measurable dimensions of the operating domain. This can be done through reference to the anticipated dimensions of variation stated in the ML safety requirements ([H]) or defined by the operating context ([B]).
The operational domain for an autonomous vehicle indicates that the vehicle is to operate at all times of day and that the ML component should be robust to changing light levels. An ML data requirement for completeness may state: “Data samples should be gathered at all times of day and under the following light conditions: sunlight, cloud, rural with headlights and urban street lighting”.
When building a model to determine the life expectancy for patients suffering from liver failure a MELD Score is commonly used which is calculated from four lab tests on the patient [19]. Normal ranges for each of these results are known from historic data. A completeness requirement may state that: “Data samples should as a minimum include patients with Bilirubin levels across the range of [5.13, 32.49]”.
ML data requirements shall include requirements that specify the required accuracy of the development data.
Requirements may relate to the labelling of data samples. Label quality has a big impact on the reliability of the risk acceptance criteria. Deciding on these criteria involves subjective judgement and is prone to systematic and random errors [13]. In a study reported by Krause et al. the same ML model had a 30% relative reduction in errors after switching from labels established by a majority vote of three retinal specialists to labels established by adjudication from the same specialists [40].
Consider an ML safety requirement that all pedestrians should be identified within 50cm of their true position. Given that the pedestrians are not point masses but instead represented as coloured pixels in the image, an accuracy requirement must clearly specify the required position of the label including the positioning of labels for partially occluded objects. An example accuracy requirement may state that: “When labelling data samples, the position of all pedestrians shall be recorded as their extremity closest to the roadway”.
ML data requirements relating to balance shall specify the required distribution of samples in the data sets.
Consider a classifier that is designed to identify one of n classes. A data set that is balanced with respect to the classes would present with the same number of samples for each class. More generally however balance may be considered with respect to certain features of interest (e.g. environmental conditions, gender, race etc). This means that a data set that is balanced with respect to the classes may present as biased when considering critical features of the data.
DeepMind’s ML model for detecting acute kidney failure reports incredible accuracy and predictive power. However, analysis [61] shows that the data used to train the model was overwhelmingly from male patients (93.6%). In this case, an ML data requirement for balance in the gender of the data sources should have been explicitly specified since this feature is relevant to the operating context of the model (which will be used for both male and female patients). Similarly, the data was collected from a set of individuals that lacked other forms of diversity. This could lead to the results in operation falling far short of those promised for the affected groups of patients.