Data Science

Meet the Researcher: Antti Honkela, Applying Machine Learning to Preserve Private Data

‘Meet the Researcher’ is a series in which we spotlight researchers in academia who use NVIDIA technologies to accelerate their work. 

This month we spotlight Antti Honkela, Associate Professor in Computer Science at University of Helsinki in Finland.

Honkela is the Coordinating Professor of the Research Program in Privacy-preserving and Secure AI at the Finnish Center for Artificial Intelligence (FCAI). In addition, he serves as a privacy and anonymity expert in the steering group of Findata, the recently established Finnish Health and Social Data Permit Authority, and has given expert statements to the Finnish Parliament on legislation related to health data privacy.

Figure1. Illustration of noise-aware differentially private Bayesian learning in learning a Gaussian process (blue curve with shaded blue region denoting posterior confidence region) to approximate an underlying function (black curve) from 1,024 noisy observations from a recent pre-print. Different panels show how results become more accurate, and the confidence region shrinks correspondingly under decreasing levels of privacy (increasing ε).

What are your research areas of focus?

Most of my group focuses on developing machine learning and probabilistic inference methods under differential privacy. This provides a strong guarantee that the results cannot be used to violate the privacy of data subjects. I also supervise two students working on applying probabilistic models to analyze genetic data.

What motivated you to pursue this research area?

I have been interested in math and computers for a while. After one year at university, I was given the opportunity to work as a research assistant for Dr. Harri Valpola, who is now the CEO and co-founder of Curious AI. It was during this time that I became hooked on Bayesian machine learning.

Bioinformatics came into the picture after I received my PhD when I struggled to find an application for my machine learning work, which was not apparent at the time. Thanks to an opportunity at a NeurIPS workshop, I met Professor Eric Mjolsness; he told me that some of the models I had been developing might be perfect for modeling gene regulation.

After a few years of working on bioinformatics, I moved back into machine learning to work on differential privacy. This has been an excellent opportunity to link my research, my long-term interest in digital human rights, and my solid theoretical background in mathematics to help solve what I believe will be a significant bottleneck for machine learning for health.

Tell us about a few of your current research projects.

One major project in my group is the work led by Dr. Antti Koskela on using numerical methods for accurate privacy accounting for differential privacy. Differential privacy allows the deriving of an upper bound on so-called privacy loss of data subjects when their data is used. However, the loss increases with each additional access to the data, and it is easy to derive very loose upper bounds on the total loss. Still, these provide a very pessimistic view of the actual privacy loss. Deriving accurate bounds for complex algorithms such as training a neural network with differentially private stochastic gradient descent has been a major challenge, but our work provides an efficient numerical solution with provable error bounds.

Another major initiative is developing tools for differentially private probabilistic programming, which allows the user to specify the structure of a probabilistic model. At the same time the system will automatically derive an algorithm for learning the model from data. Such models allow creating anonymized twins of sensitive data sets more efficiently by easily incorporating prior knowledge. This work is based on a very close collaboration with researchers from the group of Professor Samuel Kaski at Aalto University, and led by Joonas Jälkö and Lukas Prediger.

What problems or challenges does your research address?

We want to develop technologies that allow using sensitive personal data such as health data for things like precision healthcare with guarantees that data subject privacy is maintained. I believe these will be essential for achieving the desired AI revolution in healthcare in a societally sustainable way.

What technological breakthroughs are you most proud of?

From our recent work, I am excited about noise-aware differentially private Bayesian inference we have recently developed for generalized linear models such as logistic regression (led by Dr. Tejas Kulkarni from Aalto University) as well as Gaussian processes. These methods beautifully bring together two important technologies: differential privacy for strong privacy protection, and Bayesian inference for quantifying the uncertainty of predictions and inferences. These are a perfect combination because differential privacy requires injecting more randomness to guarantee the privacy and with these methods we can quantify the impact of that randomness in the final result.

Going further back and really technical, things that stand-out are using natural gradients in variational inference that can really speed-up learning, and has led to significant later breakthroughs in stochastic variational inference and Bayesian deep learning.

A small but significant technical breakthrough that enabled a few major papers, but did not make the headlines, was a method for expressing the computations by using numerically stable evaluation of differences of so-called error functions. These come up in operations with the Gaussian distribution, and recently came up even in some differential privacy work. My original MATLAB code has now been ported to many other languages.

How are using NVIDIA technologies for your research?

GPUs make training large machine learning models a lot faster and we use NVIDIA V100 and A100 GPUs extensively in my group. I really wish such tools would have been available when I was doing my PhD in the early 2000s using weeks to train neural networks.

Training models under differential privacy has caused some problems here, because it needs access to per-example gradients that standard deep learning frameworks do not support efficiently. I am really happy about the great collaboration we had with the NVIDIA AI Technology Center in Helsinki who helped make our differentially private probabilistic programming code run really fast on NVIDIA GPUs.

What is next for your research?

I have two big goals at the moment: developing new methods to allow doing machine learning and Bayesian inference better under differential privacy, and bringing these to users in open source tools that integrate nicely with their existing workflows and run efficiently.

Any advice for new researchers, especially to those who are inspired and motivated by your work?

Lasting scientific contributions arise from rigorous work built on a solid foundation. There are a lot of components out there that tempt you to try some quick hacks for a quick result, but these very seldom lead to lasting results. This is especially true in fields like privacy, where mathematically rigorous privacy proofs are essential, and seemingly minor details may break the proof for some otherwise attractive combination of methods.

To learn more about the work that Antti Honkela and his group is doing, visit his academia webpage.

Discuss (0)

Tags