While scaling from 150 to over 1300 people in 3 years, Trendyol migrated from monolith to microservice architecture. Our manual processes were breaking, and this required more work from developers. Developer experience began deteriorating, so we had to make a plan. Here’s how we improved developer experience while scaling.
Trendyol is an ecommerce platform that connects millions of buyers to hundreds of thousands of sellers. Because of the large number of transactions and users on Trendyol, it’s imperative that our platform is trustworthy and has a smooth user experience.
We began scaling in 2018. We had a monolith application with one big datasource and around 100 people contributing to this single application. As our team size increased, the number of users in our system increased logarithmically. The concurrent number of users went from 40 000 to 50 000 in 2018 to around 2.5 million in 2021.
While doing this, we launched campaigns during our busiest quarter. We noticed some problems:
- Developers were regularly staying at the office until 2 or 3 AM.
- New servers were being done manually.
- Things generally seemed like a nightmare for everyone.
To scale a single app, we had to manually create instances, do manual deployments and add new machine ips to a config file. It took more time to develop, things were painful to deploy, and monitoring was hard. Developers had to do overtime to keep up with the business needs and technical challenges. This was unsustainable.
While platform stability and security are important concerns, so is developer experience. We wouldn’t have a platform that creates value for our customers without our developers.
What do we mean by developer experience?
Developer experience means happy teams and team members who are empowered to focus on business problems and challenges. Developer experience is about how we feel when we code, deploy and monitor. That’s why we think of developer experience in these three steps.
For us, a typical productive day for a developer should consist of development with the help of infrastructural products, deploying without worrying and feeling comfortable and safe after deployment as every bit of the system is well monitored. Before 2018, instead of focusing on the business challenges, our developers spent more time finding solutions to how infrastructure should work and searching for correct scalable solutions to cross-cutting concerns like logging, caching, and connecting to datasource.
This was detrimental to developer experience, because instead of creating more value, we were trying to preserve the system stability.
While deploying, developers had to create instances manually and check for the deployment steps. At the end, when everything was deployed, developers had to assume everything would work because monitoring the system was not transparent enough to the teams. It was hard to correlate any fluctuation in the system with deployments or any part of the code. For example, if there was a memory leak or load in the CPU, developers had to wait for another team to monitor that and inform them. This was inefficient and frustrating.
To improve our processes and developer experience, we started creating tools for easier developments, faster and more reliable deployments, and better monitoring of our production environments. We kept four key metrics based on the book Accelerate in mind that guide our software delivery performance:
- Lead Time
- Deployment Frequency
- Mean Time to Restore (MTTR)
- Change Fail Percentage
These four metrics allow us to measure our software delivery performance, and we used this as the foundation to inform improving developer experience. In particular, we wanted to ensure that developers could:
- Deploy a feature as fast as possible
- Deploy a feature as many times as possible within a day
- Restore as quickly as possible in case an error occurs
- Deliver features instead of fixing bug
We structured our solutions to focus on the three aspects of development central to developer experience: developing, deploying and monitoring.
Developing
To improve the coding aspect of developer experience, we created processes to decrease the cognitive load on developers:
- Logging Infrastructure, so teams can log anything easily based on their logging needs. Before, every team had to monitor and maintain their own logging infrastructure manually.
- Caching with a sidecar pattern, so that teams have an infinite cache with just one line of code. Previously every team was creating the same or similar feature with different infrastructure. This meant each team had to do everything by themselves.
- Config & Secret Management, so that all config and secret information can be stored and accessed easily with security in mind. Before this, all product passwords were spread everywhere in the codebase, which meant there might be security problems in the system that are hard to pinpoint.
- RateLimiting, so that teams could monitor and limit the load on their systems. Before we implemented this, every team came up with their own solution, which was inefficient and time-consuming.
After these initial improvements, we also added SDKs and helper systems to ease the daily life of our team. This helped minimise the load on teams and developers and improved developer experience.
Deploying
The deployment process used to be manual, and adding new servers or new environments was hard for developers.
To improve these, we started using Gitlab and ArgoCD to deploy changes to production. Everything is ready out of the box, so developers do not need to reinvent the wheel. A team can then improve, change or tailor the pipelines based on their needs.
Because of the changes we made, our deployment frequency went from once a week to over 100 times per day. Developers were less frustrated because they could deploy faster and didn’t waste time on manual deployments.
Monitoring
Each team used to be responsible for manually monitoring their production environments. Scaling exacerbated the need for monitoring, and developer experience deteriorated, because we could only find problems hours after they occurred. This meant developers spent more time finding and fixing bugs and less time developing and deploying.
Trendyol introduced tools like Kibana, Prometheus, Grafana and internally developed application performance monitoring tools to automate monitoring. These enabled alerts, graphs and monitoring environments, which in turn enabled teams to focus on building and shipping features instead of spending time on monitoring.
All new servers created via scripts are included in production monitoring systems, and teams can create their own dashboards based on technical and business KPIs. So, a team can sleep peacefully knowing that the notification systems will inform them immediately if an error occurs.
Because of these changes, in 2019, we were able to find a mysterious problem that occurred during peak time. Using the tools mentioned above helped us identify the problem and check if we’d fixed it. We would not have been able to do this quickly and easily if we had not automated our monitoring.
Since implementing these monitoring changes, developer experience has greatly improved and stayed stable. Developers can scale their systems automatically and scale K8s on the fly with a few clicks or based on events. In addition, databases scale easily and do not add any load to a developer or a team.
Where we are today
We have more than 80 000 pods running on Kubernetes and more than 1000 Datasources for around 3 million concurrent users on a peak campaign day. You can check the actual numbers from our inframetrics page here.
We now provide platforms for more than 150 domain teams. That’s massive growth and a huge undertaking in terms of developer experience. We’ve seen the efforts we put into improving developer experience pay off because the teams can focus more on domains, businesses and scaling than they were four years ago.
In the last four years of centring developer experience while scaling, we learned:
Cache everything and cache everything as close to the client as possible. This meant we could handle more traffic, especially during peak times and put less pressure on developers. A developer can feel safer in peak-time traffic if there is less traffic. It’s easier to cache forward rather than handle the actual peak traffic.
Be prepared for system failure, so you have backup plans for recovery, reindexing the data, or in case a system goes down. It should be quick and easy to set up another system if you're prepared. Infrastructure as code plays a major role here because setting up a new system will be much faster and have fewer errors.
Minimise the cognitive load and external dependencies on the teams, so that they focus on the business outcome, not extraneous cognitive load.
Automate and automate everything to do everything faster and with less human error. This helped us create datasources quickly and deploy them without any human error. Implementing automation tests also made sure we swiftly tested every production change. Anything that is not automated did, and will, cause problems.
Future Improvements
We’re applying these developer experience learnings in our new verticals and new domains. We’re aiming to reduce the cognitive load on our infrastructure teams so they can focus more on the business and architecture. We expect to deliver these new projects faster and more robustly because of the time and effort we’ve spent on improving developer experience while scaling over the past four years.
For 2022, we have more international campaigns as we expand operations to 27 countries in Europe. So, new challenges arise every week. We’re aiming to grow from 1300 to 2000 team members this year, and our main aim is to improve developer experience more on mobile and the frontend side.
New experiences and new challenges will arise while developing any new business. We believe that what we invest in developer experience increases productivity and team happiness.