Machine Learning engineering – pragmatic approach to tooling

Hey, I’m Denys and I’m a Machine Learning Engineer & Team lead. I love building production-grade ML systems and talk about it – you can see some of my past talks here.

For the past 2 years, I’ve been working at Bolt, which was recently voted as “Hottest European Unicorn” by TechCrunch . As a recap, Bolt is the leading transportation platform in Europe and Africa, with fast-growing ride-hailing, e-scooters, and food delivery businesses.

This post covers my thinking around the building and using ML tooling, and approaches that served me and the company well in our journey of building a Data Platform.

My Experience

In my time at Bolt I’ve been working on a variety of topics related to data – from data replication into the data warehouse, developing in-house feature store for machine learning (and not only), building first versions of data lake on top of S3, and developing our ML infrastructure & tooling.

For the past year, I’m solely focused on making sure that our growing data teams can be efficient and deliver value rapidly & at scale. I do that by leading a team & writing code that solves common challenges in machine learning projects and solves some rather unique challenges – for example, as we operate in real cities that are unique in their dynamics, each city needs a unique “personal” ML model.

That’s why I think my experience of working with ~10 ML projects, 10+ data scientists and many more engineers involved in building data products gives me a perspective on how pragmatic and practical ML projects can bring success.

//And it’s also a company that leverages data and algorithms in every part of the organization. I initially wanted to call it data-driven, but in some circles, this can be considered a d-word 😉

So what have I learned on my journey that engineers and data scientists can learn from?

Build vs Buy (or Rent)

One of the most important decisions you make is what to focus your time on. As an engineer or manager, your decisions in this domain can produce outcomes that can vary by orders of magnitude.

Rule of a thumb – the smaller you are, the more you should use existing solutions, open-source software, and technologies. The larger your organization gets, the more opportunities you’ll see to get decent returns on investment and effort in building custom tech outside of your core business (think Facebook using custom Mercurial for their repos).

And when you choose to buy (or “rent”, as many tools are sold on a pay-as-you-go basis), try to buy in things with traction, vision, and resources behind it. The next section covers one crucial aspect of making such long-term decisions and how to choose tools.

/* Small note – I consider using free and open-source systems as a “buy” decision – you need to invest time to learn&adopt any solution and write integration with the rest of the systems – the time is money). */

They say “pick tool for the job”. Data and ML team has so many tools around that picking tools can bring you to analysis paralysis. How do you learn about and choose from 200+ ML OPS tools? (This post aggregates some insights from researching 200 ML OPS tools – https://huyenchip.com/2020/06/22/mlops.html, worth checking out).

The intelligent app ecosystem (is more than just bots!) – TechCrunch | Artificial intelligence algorithms, Artificial intelligence technology, Artificial intelligence — Intelligent App ecosystem, taken from Techcrunch

Pick ecosystems, not tools

There is a big temptation to pick a bit of everything because of minor annoyances like “X can’t do Y as easily as Z” (Example from my practice – we have had Airflow in the company for almost a year, and were considering a platform for scalable retraining of hundreds of city-specific models. One drawback of Airflow for this job was that “Airflow doesn’t allow to parametrize workflows from UI, and Kubeflow does, so let’s use Kubeflow”. If you don’t resist the urge to act on such arguments, that are reasonable but not big enough, you can easily end up with Zoo of technologies, and every single person on a team has to go through the learning curve with each of the tools. Moreover, integrating them into a coherent picture for end-user can be a mess.

On the other hand, when you pick an ecosystem/large platform, a tidal wave of users would make it happen. For example, the drawback of Airflow I’ve mentioned was solved recently in Airflow 1.10.8 – https://issues.apache.org/jira/browse/AIRFLOW-5843. When I’ve picked Airflow as a tool for scheduling & orchestration of workflows at Bolt, we didn’t have a need, and Airflow was still in the incubation stage at Apache Foundation, which meant risks, but the Airflow community was so strong, with other companies capitalizing on it – like Astronomer for managed Airflow (by the way, they have a lot of great posts about Airflow in their blog) and Google building solution on Google cloud that’s built on top of Airflow – https://cloud.google.com/composer.

Although you never know for sure how a platform can develop, there are signs you can see in the community to minimize the risks of baking the wrong horse.

When making the decision whether to integrate many small “tools for the job” vs “several big platforms for a wide range of problems” – picking the latter pays off in the long run.

Right timing

With so many ideas floating around what one should be doing in a machine learning project, it’s easy to slip into a slippery slope of doing things that “Insert Big-Tech company Name” is doing.

I’m guilty myself in this – I was eager to introduce data drift detection. (noticing that features look different in live data as compared to test dataset and thus hinting at the changes in real world and/or bugs in upstream code that produces the data. I was really impressed by TFX (Tensorflow Extended) tooling and Tensorflow Data Validation, in particular, so I thought we should have it. But as our data science team consists of very practical and impact-driven people, through the discussion I saw that we had to focus on other topics first.

And in my experience, if you can delay (read – push back & procrastinate) something for long enough, AWS (or others) will build it, which is what has happened with ML data monitoring, as well as many other things, that have saved us engineering time and allowed to focus on more impactful projects.

On a closing note:

These days most ML engineers also need to act as architects – view the system as a whole, think long term, build proof-of-concepts and try new tech in a way that hasn’t been as needed for most engineers working in more traditional domains.

This means bigger responsibility as the company and teams will live with the consequences of your choices today sometimes for many years; This also means an exciting opportunity to be in a dynamic field where best-practices are still being solidified, so you can make a dent in the field either by contributing directly or by growing communities of a technology of your choice by providing them to your team.

On a philosophical theme, I believe that building Data Science and Machine Learning projects today on top of the solid foundation means laying the first bricks in a fair and bright AI-powered future.

If you are passionate about making machine learning great and useful at scale, build new platforms and capabilities in an exciting tech company – we’re hiring! Ping me directly and I can tell you about the opportunities we have.