Continuous Development of our Mobility Platform
April 22, 2020 | Technical Blog
I remember our early production deployments at Bestmile, back in 2016. There were three main software components at the time. The “core” component — holding the seed of Bestmile’s Fleet Orchestration Platform; our homemade API Gateway — holding the plumbing around the core; and finally, the Dashboard web front-end.
This early version of the platform was not serving thousands of customers. Yet, the deployment felt uncomfortable. I remember these deployments for two reasons:
These were the early days and since then, Bestmile’s platform grew beyond three components (30+ as of Q1 2020). While the complexity increased, Bestmile’s customers’ operations were growing in size and requirements:
Continuous Development is one of those “non-functional” product requirements that are not visible, yet unavoidable to keep the platform running 24/7, while at the same time reducing the time for new features to come to life.
Continuous Development is an umbrella term usually including both Continuous Integration (CI) and Continuous Deployment (CD), as well as the processes (human and automated) around specifying, delivering, and operating software continuously.
CI is the technical process where software code is built, tested and validated automatically. It gives software engineers feedback on the quality of their development.
CD usually happens after CI and is the automated or semi-automated process of continuously pushing the result of the CI process to environments where the software can be operated — eventually to a production environment used by customers.
At first, for someone not used to software development lifecycles, it might sound counter-intuitive to push code continuously to production in a system that is critical to daily transport operations. That means the product would be continuously changing, though not always visually changing. Still, the advantages of Continuous Development exceed the drawbacks.
For business teams, Continuous Development drastically reduces the time-to-market of new features and fixes. From specification to delivery, time can be reduced from months to days.
For product teams, more frequent and smaller changes mean quicker and more focused feedback from internal stakeholders and from customers. It also helps with innovation, when we don’t always know in advance what will work well and need to experiment in a real production setup. In a changing mobility market, that is gold.
For engineers, making sure production systems are always up and running leads to smaller changes in the code. Smaller changes are more stable and controllable. Smaller changes also mean easier rollbacks when something goes bad. With a Continuous Development mindset (a.k.a. DevOps), developers are also owners of production systems. They see the result of their work sooner and can adapt faster.
Quality
Continuous Development without a good testing strategy will lead to catastrophes in production. With the testing strategy becoming central, the role of the QA Engineer changes drastically in this setup. Instead of being a gatekeeper, the QA Engineer becomes a coach and a developer focused on quality. Automated test-suites become part of the product and cover the whole pyramid of tests, as “manual testing” becomes the exception in a Continuous Development setup. That is a significant effort distributed among developers and testers.
Tooling
Automated tests, automated deployments, and monitoring are essential to making CD work. Bringing new tools means supporting the infrastructure for them, or paying for 3rd parties. This costs both money and effort upfront.
In the old days, each “role” had his own time in the process: Design, Architecture & Security, Development, Testing (QA) — including Perf and Security — and Deployment.
With a Continuous Development process, all these roles are merging, and the responsibilities associated fall back on all developers.
At Bestmile the QA role transformed from a “pre-deployment-gateway” role to a “quality strategist” role. Testing is now the responsibility of each developer and starts at the Unit Test level (bottom of the pyramid). The QA Engineer is here to identify the bigger risks, catch quality gaps as early as during specification process, and own the testing infrastructure and automation scripts.
Regarding deployments, each engineer has the right, duty, and responsibility to perform deployments in production. There is no deployment group.
Continuous Development is not only about the tools. It’s also the mindset. Being in the context of a startup, with very scarce resources (money and time), and with engineers who need to take care of wide breadth of features and other non-functional requirements, Continuous Development becomes “one of the many things” on the table.
Transforming the way code is deployed requires the right mindset, teamwork, and specific cross-team processes that need to evolve with the product.
There is never one point in time where someone could say at Bestmile: “That’s it, we have Continuous Development setup and working”. Instead, it has become one of those non-functional platform features that we keep evolving along with the product.
The first step for this transformation is the mindset of the engineering team, and all other actors such as product managers and customer facing teams.
First rule of the Continuous Development club: do not talk about it… err. Actually, you need to talk about it early, and all the time. The goal being to make not only the engineers, but also the larger team understand how code changes are going to hit production, and what that means regarding SLAs, quality, impact on customers, documentation, and communication.
Below are a few rules we set that are part of every development effort. Spoiler: they are pretty standard in the software industry.
#1 — All developers publish their code in production
There is no handover to any dedicated “Deployment” team. By publishing to production, developers are more aware of the conditions necessary for their code to work. They are also in the best position to support any following issues.
By having the responsibility and the ownership to deploy to production, every developer will also make sure that they factor in the right level of quality into their work. They have very good incentive to block development work that would jeopardise production stability.
#2 — Feature promotion: have a clear and fixed promotion path from the developer’s machine to the production environment, even for hotfixes
To be able to automate testing, and to build predictability regarding the feature’s reach to production, it is necessary to define a clear path. When is the feature being integrated? When is it tested and by which test phase? How do we rollback and hotfix and what is the impact on the deployment process? Bypassing these promotion paths puts the quality at risk.
At Bestmile, the feature promotion is done automatically, based on a gitops process (see below), and looks like this:
#3 — Backward compatible by default
This is probably one of the most important rules, and also one of the costliest. When changes are applied to the code, developers need to identify any breaking change in terms of function signature (REST API e.g.) but also behavior.
At Bestmile, as development is happening fast, endpoints are all versioned independently. When a breaking change is necessary, the endpoint’s versions are usually “bumped”, and the compatibility with lower versions is guaranteed as much as possible.
As a rule of thumb:
There are some corner cases, but all developers at Bestmile are aware of and follow backward compatibility principles
#4 — Dissociate deployments (to production) from deliveries (to clients)
Continuous Deployment of code means that the code is reaching production before the full readiness of an end-to-end feature. It’s especially true in bigger, more complex systems.
Some features might require multiple sprints to be fully ready for a customer to use. In a more traditional deployment format, features used to be ready-to-use when the code was deployed to production.
With a Continuous Deployment approach, some components might be ready in sprints, if not months in advance, while the full feature is not yet usable.
#5 — Feature flags
Feature flags are a mechanism that allows the user (or an admin) to enable/disable the usage of the feature for customers.
Linked to rule #4 above, this allows us to orchestrate the development of a feature at different speeds between teams. That way, even if the full feature is not available yet, we can already start using part of it, or we can enable it earlier to some preselected and trusted customer. This allows faster feedback loops. The drawback of this is that you need to account for these flags in advance, and document them.
Customer facing teams need to understand the orchestration of delivery vs. deployment to know when to communicate what to customers. On the other end, we found that product management teams have a big role to play in that orchestration.
Some leading companies in the software field call that the BusProdDevOps (Business+Product+Development+Operations) mentality, related to what was known in software as DevOps (Development+Operations)
This highlights the need for business and product to be aligned with how the software is built.
The details of what works for Bestmile at the moment are going to be shared in a separate article. Below is a summary (and spoiler!).
Bestmile leverages Kubernetes and Docker container technologies to orchestrate 0-downtime, Continuous Deployments using out-of-the-box rolling updates strategies.
But installing Kubernetes is only the first step. We had to prepare all our services to support this kind of deployments, and shape the tools that would fit our team processes.
Stay tuned for more details of how we use and orchestrate those technologies in the follow-up part of this article.
The last point proves to be wrong, as smaller incremental deployments tend to break less, are more stable, are easier to roll back, bring feedback sooner, and are also easier to fix (in case a rollback is not an option).
Stability keeps improving, as engineers get better at deployment and build habits and tools, spread the mindset, the skills and the processes around this deployment practice.
Issues still happen, but they are less critical, less frequent (or at least not more), and last for a shorter amount of time.
As of the publication of this article our Platform and Dashboard now totals