The Future Is Federated (Learning)

The potential applications of machine learning for earlier disease detection was one of the first things that drew me to federated learning and ML. In high school my dad found out he had kidney cancer. He only received his diagnosis after a few years of what appeared to be unrelated organ shut down, and visiting many, many doctors who had no clue what was happening.

The doctor who did make the diagnosis had access to a searchable repository that tied a few repeated cases of renal cell carcinoma (RCC) to these seemingly random shut downs, which left me thinking about why he hadn’t been able to access that information sooner.

This drew me into the privacy preserving machine learning space. When I was 19 I applied to INOVA hospitals accelerator to research potential applications of ML, and to better understand biomarkers like miRNA concentration for better RCC diagnosis. As I came to understand HIPAA and the broader complex regulatory space medicine was stuck in, I began to research federated learning specifically.

In 2017 Google published this research detailing how they use federated learning for the Gboard (Apple shortly followed suit). Its primary use is for effectively training local models on search queries without sending the entirety of users’ personal data back to their servers. This piece pulled me deeper into the space.

A phone personalizing the model locally, based on your usage (A). Many users’ updates are aggregated (B) to form a consensus change © to the shared model, after which the procedure is repeated.

A Brief Explanation

In short, federated learning is one of many types of privacy-preserving machine learning. This approach specifically enables the users’ personal data to stay on their device (or, in other use cases, enables data to stay on servers) while the model is trained locally. Only the model update is then sent to the cloud. Federated learning represents the potential of machine learning with the benefits of distributed power (and data ownership) among users. For a broader overview on the variety of types check out this explainer by OpenMined.

How Then, Shall we Live?

Questions of data privacy, who owns ones’ data, and the power that comes with that responsibility are more relevant than ever. We believe that now is the right time for an enterprise company to be built, especially as the research surrounding FL has matured.

Our Approach

After the 2017 Google paper piqued our interest, we spent the next few years meeting with operators and researchers across data privacy and privacy preserving machine learning broadly. As we began to build our thesis on both the application and timing of the space, we started to see the first wave of Federated Learning companies pop up. What we saw, however, was that many would initially align with our thesis on the diverse potential customer base for this technology, but eventually would end up in a narrow scope of fraud detection. Not a bad thing, but perhaps a sign that it’s still too early for the horizontal opportunity we thought was here

(For broader notes on the future of compute architecture, see @mhdempsey ‘s “What kills Cloud Computing: A history of time shared computers and one device to rule them all”)

I like to daydream about the romantic ideals of the information structures of the future by examining the past. This is the ancient library of Alexandria, one of the largest libraries in the world.

Where the Future Lies

What makes us excited about this space is the amount of critical industries that have understandably been unable to adopt machine learning because of privacy concerns and sensitive data. Whether it’s because of regulation (Europe’s GDPR & California’s CCPA), technical limitations, or concerns from stakeholders, there are many limiting challenges front of mind.

We think that because of some of the challenges faced by highly complex internal teams that are typical for government, pharma, or banking, sending forward deployed engineers out for the first year or two to gain internal understanding of these teams and accelerate product market fit holds a lot of potential (not dissimilar to the way Palantir approached working with the government). The administrative burden of having an appropriately dedicated engineer thinking about how to architect your application of federated learning means that the industries that are best suited for FL have a sufficiently high regulatory or other privacy related burden that means they’re both economically and structurally motivated to spend time implementing FL. We see finance, pharma, and government as likely being the first movers in this space, with a long tail of possibilities across healthcare and other industries.

Who is Taking on the Challenge?

Currently Doc.AI and Owkin are using FL with the intention of implementing cross device FL for medical research, and Intel focused in on the FL for medical imaging space specifically. This piece lays out a simple framework for federated learning on vessel segmentation, if you want to try it out! This EU funded paper and research details the potential of FL for drug discovery virtualization.

Musketeer is pushing forward use cases in smart manufacturing and medical use cases. Nvidia’s Clara is a reference application for distributed AI training that’s designed to run on Nvidia’s recently announced EGX edge computing platform. Additionally, FedAIDevronDecentriq, and Datafleets are also all generally focused on developing general enterprise federated learning platforms and frameworks.

Constraints, Challenges, and Open Questions

There are many different particular types of federated learning, and we’re excited to continue to read as the space solidifies and implementations popularize. (We’re often looking for teams in this space that are post academia or spinning out of a research group. So, it’s always exciting to receive white-papers about new research that they’re implementing at their company.) We came away with a few core questions about the challenges of the space:

  • What unique challenges do the constraints of the devices the model is trained on present? With cross device federated learning, the devices that are gathering the data must be able to to train a model. There are also unique challenges around the various fidelities of data a variety of devices might collect, and the speed at which they all train the model so that they deploy the update to cloud simultaneously if necessary.**
  • What new, and likely under-researched security risks do FL systems represent? One of the open questions in this space is the potential to reverse engineer details about personal data from the overview that is sent to the cloud. For example, A sybil attack, represents some risk for FL. We’ll continue to follow along as the security research progresses in this space.
  • What level of parallel computing is possible? Current algorithms only work with device numbers in the 100s, hopefully this number will progress as algorithms do
  • How do we deal with non-IID data? The traditional statistical assumptions made with many ML models (ie that the data is independently and identically distributed) aren’t always ideal for federated learning. So- how we account for or apply this to the ideal use cases is something we’re still thinking about. (This piece by DataFleets (a privacy preserving data engine) gives a great illustrative example of nonIID data if you’re not familiar). Edgify, for example, has proposed federated curvature. This adds a penalty term to the loss function, compelling all local models to converge to a shared optimum.
  • What more is possible for federated computing, outside of just machine learning?

In light of all this, it’s clear that privacy preserving ML, and federated learning especially, are a core part of the future we believe in and we’re excited to play a part in it.

As always — feel free to tweet or message me questions, thoughts, disagreements, or pitches on twitter or at


** There’s a movement to better understand the tradeoff between communication costs since end-users’ internet connections typically operate at lower rates (Yuchen Zhang, John Duchi, Micheal I. Jordan, and Martin J. Wainwright. Information-theoretic lower bounds for distributed statistical estimation with communication constraints. In Advances in Neural Information Processing Systems, pages 2328–2336, 2013)