Checklist: Anonymization for ADAS and Autonomous Vehicles
All you need to know to anonymize ADAS datasets and comply with data protection laws
Several technologies enable vehicles to navigate autonomously without human assistance including GPS, LiDAR, sensors and cameras.
One technology that has been progressively gaining attention is the usage of automated object recognition through video data. Its development requires the collection of a vast volume of imagery, taken from all manner of driving scenarios, in order to train the AI systems to accurately recognize pedestrians, traffic signs, other vehicles, and so on.
The collection of such data is essential to the development of autonomous vehicles, but it also poses a threat to individuals' privacy due to the accumulation of a large volume of personal data
As we described in this article, the general public is concerned about the privacy impact of the widespread use of autonomous vehicles. According to this research, 54% of participants would spend more than five minutes using an online system to opt-out of identifiable data collection. Thus, a variety of measures focused on protecting personal information should be applied to increase the general acceptance of autonomous vehicles.
Why anonymization for ADAS and Autonomous Vehicles
First of all, we need to differentiate between two types of personal data. The first type is primary data — which is recorded and collected when you’re using the car by yourself. For example, data like the kind of music you’re listening to, so you can receive personalized recommendations like what you get on Spotify or Netflix.
The second type is secondary data — which is recorded “indirectly”, like a pedestrian walking or a cyclist riding.
Users normally deal with the first type of data, the use of which is usually covered by terms and conditions. However, they might not be as happy to be recorded without knowing how this data will be used or have given consent in general.
Figure 2: A Fully Autonomous Driving Journey ©Waymo Inc.
The introduction of GDPR in Europe has created a regulatory framework to protect users for abuse or misuse of personal data without their consent.
Art. 7 of the GDPR states that written consent by the data subject should be proved in order to process its personal data. Also, the data subject shall have the right to withdraw his or her consent at any time.
Infringements of the GDPR regulation could be very costly and diversified:
- Fine (art. 83): up to € 20 million or, in the case of a company, up to 4% of the total annual turnover achieved worldwide.
- Compensation (art. 82): calculated case per case by the National Data Protection Authority (DPA)
Anyway, request data from hundreds of thousands of pedestrians could be cumbersome, time-consuming and costly.
Pseudonymisation and Anonyimzation
Fortunately, the GDPR proposes an alternative to data consent: anonymization. However, we must not confuse anonymization with pseudonymization.
Pseudonymization is described in GDPR and is used to remove direct identifiers (such as name or social security number). Instead of a driver’s name, a pseudonym identification number is used or a vehicle identifier. This has been a common practice in the ADAS community for many years.
However, this is not enough to comply with data protection laws. In fact, if pseudonymized data is always re-identifiable by the data controller, anonymized data is defined as “data rendered anonymous in such a way that the data subject is not or no longer identifiable” by nobody (including the data controller, i.e. the ADAS company).
Obviously, you don’t have to lose faces and license plates, since they’re fundamental to train your deep learning model. At the same time, you have to make sure that your engineers or other stakeholders see only the anonymized datasets. Some of the most common use-cases we encountered are:
- Field operation tests (FOTs)
- HD maps
- Validation and training
Not all data collected is actually necessary from a technical perspective to enable connected and autonomous driving, for example, data that the individual driver/user enters for infotainment purposes or comfort settings (data minimization).
At the same time, the processing of all these data is often carried out by multiple other entities, known as data processors (e.g. data labelling companies). Data controllers are expected to ensure the data protection requirements for the collection and use of personal data by minimizing the amount of shared data and anonymizing all the personal information.
What personal data to anonymize from ADAS data
For many years, the ADAS community have stated that anonymization is needed to ensure the privacy of the participants (e.g. when publishing results or to enable data sharing). This is true, but incredibly difficult to achieve as long as the original dataset is still accessible, thus reversible and re-identifiable.
AV collected data can be classified under three main categories:
- Owner and passenger information like comfort, driving and entertainment settings.
- Location data such as vehicle GPS vehicle location, speed, real-time traffic, etc.
- Sensor data, including cameras or dash cams - front, rear and side cameras - radar, thermal imaging devices, and light detection and ranging (LiDAR) devices.
As we mentioned earlier, anonymization refers to any technique that irreversibly distorts data in such a way that personal data cannot be reconstructed.
Owner and passenger information do not constitute a regulatory issue (since compliance is covered under terms and conditions between the user and the company) unless shared with third-parties (e.g. a data processors).
Location data are still an “unsolved problem”. Researchers have proved that only a few locations points are enough to re-identify an individual with 95% accuracy. Generally, current anonymization techniques weren’t really effective against re-identification. Differential privacy provides a promising privacy definition for location data, but research is still premature for an application at scale.
Lastly, sensor data (in particular imagery) has gained great attention by companies and regulators. Anonymization techniques such as blurring have gained large adoption due its technological maturity and effectiveness to protect personal data.
Among all collected objects, faces and bodies are the most fundamental and highly visible elements of our identity. Hence, they fall under the definition of personal data. Similarly, license plate numbers can be used to trace the identity of the subject.
How to anonymize personal data from AV data
For most of the use-cases, blurring offers the best trade-off between performance, anonymization and reduced distortion, emerging de-facto as a standard anonymization method. In fact, companies like Google, Microsoft and TomTom are using it to protect personal data.
Currently, there are several approaches to image blurring. However, each of them has substantial bottlenecks:
|In-house manual blurring||The company has full-control of the data.||Time-consuming and consequently costly due to high hourly rate.|
|Outsourced manual blurring in low-wage countries.||Price-per-hour for manual blurring is significantly cheaper.||It is forbidden to transfer EU data in countries that do not match the same privacy standards as the GDPR - Chapter 5.|
|In-house automated solutions.||The company has full-control of the data. The process is partially or fully automatized, so manual work is not required.||Require in-house machine learning knowledge, which is at the core of ADAS technology. However, it could disperse the company's resources from its core business.|
- Why: The introduction of GDPR in Europe has created a regulatory framework to protect users for abuse or misuse of personal data without their consent. Infringements of the GDPR regulation could be very costly and diversified.
- How: Automated Blurring offers the best trade-off between performance, anonymization and reduced distortion, emerging de-facto as a standard anonymization method.
- What: Sensor data, including cameras or dash cams - front, rear and side cameras - radar, thermal imaging devices, and light detection and ranging (LiDAR) devices.
Celantur approach: Automated Blurring
Why Celantur is better than:
- In-house manual blurring: Our automated blurring solution is significantly cheaper and faster.
- Outsourced manual blurring in low-wage countries: We process imagery only in EU-based data centers. Additionally, we imply the highest privacy and data protection measures.
- In-house semiautomated/automated blurring: Our solution runs wherever you prefer (cloud, on-premise, API integration). By outsourcing this process, you don't need to disperse the company's resources (human and financial) from its core business.