The sacrifices needed to be a “Machine Learning-first” company

The inevitable confrontation between software development best practices, and the brave new world of ML

12 min readNov 16, 2024

By Yervant Kulbashian. You can support me on Patreon here.

Software development is hitting an inflection point. As Machine Learning (ML) becomes a viable approach for solving difficult real-world problems, companies are retooling their software stacks to put Machine Learning Operations (MLOps) front and center, as well as increasing their emphasis and reliance on data. What hasn’t happened, however, are the concomitant changes in human and organizational processes that are necessary for the technology to succeed. So-called “ML-first” teams still function as before, using old SDLC¹ wisdom, including the classic segregation of responsibilities that has so far served traditional software development well. But Machine learning is a different beast. Companies that wish to be successful must better understand its peculiar requirements to take advantage of this new opportunity — including what is possible, and what is no longer viable.

Machine Learning and traditional software both have their proper strengths and weaknesses; traditional software organizations have been organized to take full advantage of the strengths of software as an enabling tool, as well as to limit the impact of its many shortcomings. The Agile movement is one such organizational philosophy. The same is true of the design of software systems themselves. The many paradigms, architectures, design patterns, and their accompanying component abstractions and best practices have all evolved to make it easier for developers to understand the various pieces of the system and their interconnections — e.g. “this is the service you call for authentication, this is the one that hits the Credit Card rails during checkout”, and so on. They are also contrived to make the right kinds of changes easy — usually at the expense of other, less important ones. For example, it is easy to add a hundred new users to a website, but more difficult to redefine what a user data-object entails. Even the fundamental architectural assumptions of software development, such as the use of databases, the server-client division, or distributed systems, are all intended to optimize certain transactions that have business value. Most importantly, they aim to give developers direct control over the various domain concepts they employ (e.g. users, services, tasks, data streams, etc.) and which they must be ready to change in accordance to the needs of the business.

Machine learning upends many of these assumptions. Whereas in classic software a developer tries to maintain fine-grained control over the details of implementation — through a hierarchy of abstractions down to the smallest nuances — ML models are increasingly being given control over those implementations. Where earlier the control of the trajectory of a robot arm would be a finely tuned art and field of study, ML models frequently take over the entire responsibility of optimizing motion trajectories. Similarly pick and place robots used to be composed of many small scripts and optimisations — now a Behavioural Cloning model can handle the entire process of selection and grasping without interference. With ML, the developer is no longer allowed to stick their finger into the minutest divisions, to define their own ontology of how processes, tasks, and responsibilities are split up. Now the model is granted that decision making power, with the hope that it will make the “right” decisions when the time comes.

it is difficult to enforce strict abstraction boundaries for machine learning systems by prescribing specific intended behavior. Indeed, ML is required in exactly those cases when the desired behavior cannot be effectively expressed in software logic without dependency on external data. — Hidden Technical Debt in Machine Learning Systems

This shift in responsibility was noted early on in the development of the field. For example, in the early 2000s researchers moved from static, grammar-based language translation systems to fully data-driven ones, such as the one Google popularly offers today. The latter are allowed to discover, of their own accord, the subtle patterns in language use which constitute “grammar”, and at the same time to bend those rules as needed — e.g. patois, idioms, slang — and even to learn the nuances of when one rule must take precedence over the other (e.g. whether to split infinitives). This approach lets the system reflect the fluid nature of ever-evolving languages as they manifest in real writing. It also led Frederick Jelinek to make the infamous statement:

Every time I fire a linguist, the performance of the speech recognizer goes up.

Whenever we impose our own “expert” understanding (e.g. of grammar) into a system we may, in a certain sense, be giving it an advantage or a leg-up, since it no longer has to learn or find those patterns on its own. But we also constrain it from breaking out or beyond those rules, leading to a long tail of exceptional failure cases which the model is no longer able — or even willing — to address. This is the trade-off we must make: to give up some part of our control. In contrast, pure software exerts control over the system in the most direct way possible — you tell it exactly what to do and when. In ML, however, control must be exerted through indirect means: through the selection and processing of training data, through the introduction of new sources of data input, through hyper-parameter tuning, through hand-crafted annotations, and even through the choice of when to retrain a model and when to let an failure-case slide. The type or architecture of the model — though less important than many people think — is also part of this regimen of influence.

Through these levers we push the model towards the kinds of behaviour and results we want for a given task. One consequence of this indirect approach is that it becomes critical for engineers to have deeper domain knowledge. Without exception, the models and ML projects that are successful are the ones where the team building it fully understands the nuances and intricacies of the task environment, and tailors everything — inference, hyper-parameters, training systems, data collection — to fit it. No single approach to Machine Learning currently works for all tasks. For example, in most robotics implementations millisecond latency of inference can be a critical factor for success, whereas in the space of web product recommendations, such latency is relatively unimportant. The data collected and the training employed must also be fitted to the needs of the domain: e.g. manually-created annotations must target those features and factors that will drive the model to “notice” the right things.

None of this expertise can be easily encoded in documentation or even in the code itself. Traditional software has the advantage of making explicit — through method names, variables, unit tests, classes, etc. — what the system is supposed to do, and embeds that “in concrete terms” for posterity. But in ML this knowledge is implicit in the choices of what to include and what to exclude, or through the focus on certain processes and the neglect of others. Every new change may change the system in unexpected ways, and given a lack of transparency into the black box, only a comprehensive understanding of the model’s nuances can manage that complexity.

Teams must exhibit a certain amount of in-flight adaptability with respect to requirements. The unpredictable vicissitudes of ML models, obscured as they are within a black box of a neural network, make it impossible to deterministically predict where the most common failure cases will be, which must be compensated for. The steps that will be taken to improve a model’s performance can rarely be known beforehand, and must usually be discovered during the process of training and testing. This is frustrating for developers and businesses alike, since it makes project timelines difficult to estimate. Such is the nature of the technology². As with a human child, you cannot know how quickly, if ever, it will master a task. If you want to force a predictability in development cycles you can only do so by imposing direct behavioural control on the system through scripted routines — at the expense of allowing the system to learn how to deal with those situations itself.

Ultimately, the knowledge of how to run and optimize the ML system ends up largely stored not in concrete code or documentation, but in the minds of the team of developers. If they are to move quickly, and adapt on a daily basis to exceptional or edge cases, they must function as a cohesive whole, where everyone understands the system end-to-end, and knows how their choices in one situation will influence the overall performance of the system. Although the same might be said about software development in general, however in software the abstractions (e.g. OS, network, database, service, threads, etc.) hide their underlying complexity, and present a simplifying abstraction, allowing the individual developer a semblance of understanding the whole without actually having to do so. Where those abstractions can no longer can be relied on or implemented — as is the case in an ML model that “does everything” under the hood — knowledge of the system is no longer about knowledge of its component abstractions, but about knowing how the model as a whole will function in its native environment. The model and its performance have become the abstraction, and little more can be gleaned by digging inside.

What does this mean for the structure of the teams and their day to day activities? The traditional division of work into areas of focus and responsibility (data engineering, payments, etc.), each with clear interfaces dividing the work of one team from others, is no longer feasible. There are no “divisions” inside a model except those which you ham-fistedly impose at the expense of letting the model learn. Like a toddler, whose brain you can’t manipulate directly, you can only guide a model in the right direction through experiences (data) and incentives (training/rewards). Multiple people can still work on a single model, of course, but not through team boundaries and interfaces, since those don’t exist and can’t be created except to the detriment of the product. Alternately, several separate ML models can be designed to interact and mutually support one another, but again, this is an enforced division based on what we as developers deem to be optimal, and it deprives the model the opportunity to learn what is necessary for the interaction on its own. You either trust the model to make a decision or you don’t, and where you don’t — where you hoard the decision-making for yourself — the model cannot optimize.

The ideal team, therefore, is a small cohesive unit, whose understanding of a model and its environment is thorough and complete. The project, similarly, cannot be a divided activity split across stages or teams, but a unified endeavour of that group, from start to finish. The uncomfortable truth is that a model lives and dies with its team; (most business organizations would refuse to accept this compromise on principle alone). A model cannot be improved by splitting it up among more and more teams, as is customary in software development. In the latter, whenever the system becomes too complex for one person or team to handle, it is divided into pieces and spread across departments through carefully crafted interfaces. This is not an option for a cohesive ML model. If the changes are to be effective, any alterations to the model, such as introducing new predictive heads or new sources of training data and inputs, will necessarily influence the performance of the model as a whole. And even though there are ways of segregating a model’s training regimen, such as by dividing it into a backbone and heads, this approach once again hinders the success of any given team, who now find they are not able to eke out the performance gains they otherwise could.

Relying on the combination [of models] creates a strong entanglement: improving an individual component model may actually make the system accuracy worse if the remaining errors are more strongly correlated with the other components. — Hidden Technical Debt in Machine Learning Systems

And finally, the long-term lifecycle of a model in production must reflect the peculiar needs of developing and deploying ML solutions. As mentioned above, you cannot easily estimate the duration of a project, at least not as accurately as with traditional software. So timelines must be adjusted to account for this variability. If the technology is to be a core part of your business, your business must be willing to accommodate this unpredictability, otherwise it risks clamping down on both the team and the promise of the technology. The more control you try to impose from the outside, the more likely the project will be forced into a corner where it under-performs.

Once a model has been developed, the deployment and monitoring steps must allow for unpredictable performance outcomes. Nevertheless — or perhaps because of this — the long term responsibility for maintaining the system must remain with the team itself, and cannot be passed onto some operations or QA division. The team are the only ones with the domain knowledge for how to deal with emergent situations as you scale, and what the trade-offs will be; and the ongoing care for the model cannot be juggled from foster team to foster team, who will inevitably end up making the same mistakes over and over, disrupting the growth of the system.

A hybrid research approach where engineers and researchers are embedded together on the same teams (and indeed, are often the same people) can help reduce this source of friction significantly — Hidden Technical Debt in Machine Learning Systems

Even with continual improvements and retraining every model will inevitably hit its limits, and may need to be replaced with a better one. That can be a new project for a new team, who must be given the freedom to define their own structural assumptions and optimal training regimens. Other teams can run their own experiments and train their models in parallel, without needing to rely on the original team. As new models show performance gains over existing ones, the latter can be retired or replaced wholesale. This leap-frogging process is in contrast to piecemeal improvements common to software systems, but is necessary since you can only improve system piecemeal if you can it break down into pieces. I can improve a car by swapping out the engine or brakes; I cannot, however, improve a person by swapping out their pieces with other “better” people’s. I have to accept both people and ML models as a cohesive whole, and this includes the training and data collection systems around it. Excisions on the latter can only dilute performance.

Machine learning systems mix signals together, entangling them and making isolation of improvements impossible. — Hidden Technical Debt in Machine Learning Systems

To be fair here, we don’t need to throw out all of our traditional ideas around software deployment, such as A/B testing and systems monitoring. These are still as helpful as ever. And although the technology is fundamentally new, the business needs that drive its application are still present and mostly the same as before; only the shape of the solution has changed. In practice, many of the surrounding tools built to support an ML project or model may still be reused in new projects; they need not all be built from scratch — as long as teams have the flexibility to adopt and discard what works for them, and potentially build anew³. The critical factor to consider here is the speed of iteration. Anything that slows down or impedes the cycle of training and testing, during which new edge cases are discovered and addressed, cannot be tolerated.

Nor can the responsibility for overcoming emergent obstacles be taken off the team; the solutions must remain theirs if they are to be effectively integrated into the model. As Christensen noted in the Innovators Solution, in situations where the performance of a system is still below what is required by its customers, the system must be designed in an integrated (non-modular) way in order to optimize efficiency wherever possible. It cannot be split into components with interfaces to one another, as these will regularly need to change to provide what the other components need of them. And certainly ML solutions are not “good enough” at the moment. So ML teams cannot be left waiting on, say, the data team next door to provide new feeds for each of their tentative experiments — of which there may be many every week, and few of which prove to be productive. They must have the flexibility to find and employ what they need quickly. There is a saying:

If you want to go fast go alone, if you want to go far, go together. — Various origins

And ML teams must go fast. They are both researchers and product developers: an enormous burden to place on one team’s or one person’s shoulders. They are only barely able to meet these challenges, and it takes a strong leader or coder to balance the requirements of both. Software teams are rarely called upon to implement novel and untried technologies at the cutting edge of academic research while also meeting tight product deadlines. But that is the new world of ML-first product development with which we are confronted. A company that is serious about profiting from this emerging group of opportunities must be ready to take the difficult but necessary steps in their organization, to change their thinking about software development life-cycles, to update their definitions of “good” and “bad” practice, of what are traditionally held as anti-patterns, of conventional road-maps and milestones, and of the meaning of ownership; otherwise it is doomed to timid and bearish imitation of the real experts, and thus to inevitable failure.

¹ “Software Development Life-Cycle”.

² Many common agile processes are therefore ineffective during ML development, partly due to the unpredictability of estimations, and partly due to the need for engineers to follow their intuition on the fly while experimenting with the data and the system.

³ In my experience, effective teams often build their own tools suited to their needs, for speed and efficiency, for providing the right kind of expert demonstrations, etc. Every model is a dynamic entity, and knowing its subtleties is as much an art and intuition as it is a technique.

The sacrifices needed to be a “Machine Learning-first” company

The inevitable confrontation between software development best practices, and the brave new world of ML

Written by From Narrow To General AI

No responses yet