Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
AI systems are trained with a wide range of data. Often, this includes personal information. Distributed machine learning can improve data protection in AI development, as the data being used is processed on users’ devices rather than on a central server. However, there are also more potential entry points for attackers. The German platform “Learning Systems” (learning systems) provides an overview in its current issue of “AI at a glance”.
In distributed machine learning, each end device accesses the current model and trains it locally with its own data set. Potential personal data therefore does not need to be processed through a central server.
To update and improve the ML model, not the actual data but only the training results (called weights) are shared with other end devices. There are three technical approaches to distributed machine learning.
In split learning, the AI model is trained on endpoints and the server.
In federated learning, a central server acts as an aggregation instance for weights.
In swarm learning, an AI model is trained on distributed devices without a central aggregation instance.
One potential application for split learning is image recognition for autonomous driving. The continuous improvement of a basic image recognition model could be distributed among many vehicles, each of which uses its sensor data to refine the model. They then provide the locally trained parameters to the basic model on the central server for further training. In this way, the privacy-relevant route is processed entirely locally.
Another example is the local training of AI models for smartphone autocompletion and correction. Through federated learning, only the weights of the model are shared with the central server. Texts written with the smartphone, which can give clues about life situations or even reveal company secrets, thus remain on the device.
Swarm learning could help diagnose diseases without privacy concerns. The diagnostic model is distributed on the blockchain to various clinics licensed through health insurance companies. The parameters of the central model are retrieved by the clinics, locally merged into an overall model, and trained with local health data. The parameters of the updated model are then synchronized back to the blockchain. Medically sensitive, personal information is not transferred.
“Distributed Machine Learning opens up new possibilities for effective and scalable use of data without having to share it. This enables many useful applications with sensitive data possible in the first place,” says Ahmad-Reza Sadeghi, professor of computer science at Darmstadt University of Technology and member of the IT Security and Privacy working group of the Plattform Lernende Systeme.
But there are also challenges: Distributing data and training processes across many endpoints creates new gateways for attackers. In addition, distributed machine learning requires an Internet connection for parameter exchange, which can lead to instabilities.
Model updates allow conclusions to be drawn about personal data. In addition, data of individual persons can be identified in the training dataset. The following graphic provides an overview of the advantages and disadvantages.