Open Source bridges the gap between mathematical research and industry
Nina Miolane - Geomstats maintainer
Nina is an assistant professor at the University of Santa Barbara, in the Department of Electrical and Computer Engineering. She spends her day-to-day life developing geometrics methods that analyze biological shapes for applications in the medicinal field. As an example, her research studies brain anatomical shapes and how they relate to various pathologies.
Three years ago, Nina started an open source project, Geomstats, with many collaborators around the world including researchers from Inria, the French institute for computer science research, and for example the G-statistics team of Xavier Pennec. The idea behind Geomstats came after witnessing a need in the industry. Nina was working in a startup in biomedical imagery: Caption Health. She was recruited as a software engineer to analyze images with geometric methods. That’s how she decided to gather computational methods of geometric statistics (if you don’t understand what this means, don’t worry: it won’t affect your understanding of the whole story) into an open-source package tool that later became Geomstats.
From math to open source code
Was it the first time you contributed to open source?
Yes. I had actually started to look for a tech similar to Geomstats early on during my Ph.D. with Xavier Pennec. I had found a repository and contacted the maintainer. He told me that I was welcome to contribute, to create a pull request. But there was no onboarding document associated with the repo, and to be honest, I didn’t know where to start with GitHub. My Ph.D. was in applied maths, and I had doubts about whether I had enough software engineering skills to contribute properly to an open source project.
After my Ph.D., when I started working as a software engineer, my confidence, and thus my interest in open source increased. At Caption Health, I learned important software engineering best practices: coding style, continuous integration, unit testing, documentation, etc. I also met Johan Mathe, Principal Engineer, who had a lot of software development experience, having worked at Google for almost 10 years. He was very interested in learning more about geometric statistics. He helped me set up a clean codebase and host a public repository on geometric statistics. That’s how we started Geomstats! I integrated code from my Ph.D. advisor’s team, and from published papers on the topic, and new collaborators joined quickly. We built a first version of the package and published a paper introducing it in the journal of Machine Learning research.
"Contributing to Geomstats has been a fascinating journey! Geometric Statistics is an elegant framework that helps me formalize the machine learning applications I have encountered in my career. The concepts that we have implemented in the library are helpful in a range of applications from robotics to medical imaging, and also for theoretical deep learning, all of which I have experienced first hand in the industry." Johan Mathe, CTO at Atmo Inc, USA.
Why did you decide to move to open source?
Although the motivation came from my experience in the industry, I realized that researchers in academia could really benefit from Geomstats. During my Ph.D., I had witnessed that researchers in my field were not always sharing their code, or they were sharing code that was not necessarily maintained with rigorous unit-testing. Unfortunately, this would lead the community to re-code the same methods over and over. In mathematics, proving a theorem and building a new theory may be more rewarding than collaborating on code - although I think this is changing. Additionally, training in mathematics does not necessarily include a formation in software engineering. All in all, there might not be many incentives to share and collaborate on code. Thus, I tied the need to create a mathematical solution for my company and the motivation to increase collaboration among my peers to open source methods in geometric statistics. Now, anybody can access Geomstats: academics - people who might want to use it or contribute innovative algorithms; and the industry - people who might want to use it for geometric statistical learning.
Fostering a diverse community is challenging
What difficulties did you encounter trying to grow the community?
In the academic world, the turnover of students can be high - especially if the Ph.D. students go to industry after graduating. It can be hard for us to keep continuous expertise on the projects with students only. Although it was not necessarily something that would help me secure an academic job, I started coding Geomstats during my postdoctoral studies. Postdocs or students in my position might have other priorities, and thus not necessarily the time to get heavily involved with open source projects. If they do, they often drop their engagement if they leave academia.
Now, we are trying to expand our reach and attract people from different fields. Especially developers from the industry who are willing to get involved in the long run, but everyone is very welcome! We need coders who can help with the codebase’s infrastructure AND mathematicians who want to build the core of the library. To do this, we organized a series of hackathons: one hackathon in-person at Inria in Nice, France; and two online. People from various places of the world joined. It was an excellent opportunity to attract coders from different backgrounds, explain what Geomstats does, and onboard new developers. Nicolas Guigui, Ph.D. student from the G-statistics team at Inria, also led a development project involving several engineers for several weeks for enhancing Geomstats, which definitely increased the performances of the library.
“Getting involved in Geomstats was the best decision of my Ph.D.! It made me see geometry from a whole different angle, giving me a grasp at all that structure that can be leveraged to learn from data. Not only did Geomstats help me illustrate my research but it also made me interact with many other practitioners from both academia and industry, and it even influenced my research towards a finer understanding of geometric statistics. Most of my papers use Geomstats today, and there is still so much more we can do!” Nicolas Guigui, Ph.D. candidate at Inria, France.
Additionally, we organized a coding challenge at the ICLR 2021 Geometric and Topological Representation Learning workshop. In contrast to traditional machine learning challenges where the goal is to showcase performances on a learning problem, we encouraged participants to provide creative ideas around geometric and topological learning. The submissions were excellent, and we were happy to reward the participants with prizes - also publicly acknowledging their contributions to open-source at an international meeting.
Summer schools are also great opportunities to reach new developers. We applied with a research proposal linked to Geomstats to the Summer Geometry Institute at MIT. We are also considering Google summer of code, as I am in contact with another open-source project involved in it. Even with these initiatives, it can be hard to reach out to developers beyond my own network.
What is your onboarding process for newcomers?
A lot of people are interested in contributing but don’t know where or how to start. The contributing files are often too intense for beginners as there is a lot to learn and understand. We often start by sending interested developers our introductory notebooks, so that they can understand the math behind the package. We also made a video that introduces the package to save explanation time, and that was originally presented at the SciPy conference 2020.
Additionally, I also try to do pair programming in person or via video, to explain how Github works -- if the new collaborators don’t know. If they are experienced contributors, I’ll just show them the contributing file and go through it with them. Interested developers often come with a project in mind (for example, a researcher who wants to implement a specific feature). But if they don’t, I go over our GitHub issues to identify a need that matches their skills. Even though I always enjoy interacting with new contributors, this can be time-consuming.
We communicate a lot on Github. We try to have a culture where everyone is empowered to say anything they want. We also have a Slack that is a bit more private and enables people to speak freely, share papers, ideas, job openings, etc. This is a great community!
“I initially heard about Geomstats on Twitter from Nina, and I decided to contribute for two reasons. First, rapid prototyping of ideas can lead to incredible amounts of progress in any scientific field. And second, I was eager to learn Differential Geometry by contributing to Geomstats. I had no background in Geometric Machine Learning and Differential Geometry when I started contributing in March 2021. So I started working mainly on infrastructure issues.” Saiteja Utpala, Software Engineer at MasterCard, India.
A fruitful collaboration between mathematicians and engineers
What are you the proudest of?
First, I am very happy to see mathematicians and engineers collaborate and build Geomstats together. By open-sourcing Geomstats, we were able to incentivize more students and researchers to share their code alongside their conference and journal papers. This open source project has already fostered many collaborations, as new researchers will often contact Geomstats’ contributors to get insights on their code. Papers with code integrated into Geomstats thus gain more visibility (and citations). Clean code also makes it easier to reproduce the mathematical results of a given paper and can improve checking and tests (visualization). By making papers more understandable and reproducible, the community as a whole can save time! I hope the example of Geomstats can serve other math fields beyond differential geometry and create new opportunities for collaborative mathematics.
I am also very excited to see how Geomstats is being used as an educational tool. More and more professors are using Geomstats for interactive classes and hands-on homework. Students can understand new geometric concepts by trying out some code, doing visualizations, testing mathematical formulae... Geomstats is currently used by professors in European Universities, and in the US at Stanford and UC Santa Barbara for example. I am really excited by this outcome as I feel Geomstats can serve beyond its primary purpose. Hopefully, this will also motivate students to contribute to open source in mathematics. Kickstarting change in the industry with OS mathematics.
Where do you see Geomstats in the future?
Our goal is to make tools from geometric statistics easily available, both for mathematical research and industrial applications. Geomstats is really a toolbox: I am excited to see how people will choose to use it. I hope to be surprised! So far, I have seen many applications in the biomedical sciences, but other fields are starting to try it out: for example, I have been recently contacted by a firm in finance.
Concepts from differential geometry and statistics are also being used in machine learning and deep learning research. Some researchers see differential geometry as a framework that could lead to some unification theory to understand algorithms in deep learning. Maybe Geomstats will play a role there as well?