As development teams grow, so does the number of new joiners encountering the same local development issues. Solving these issues manually can be time-consuming and expensive. That is why we at Stitch took the top 10 most common issues from new joiners and automated their detection and solutions - saving us time and money. Here is how we did it.
Common tooling issues were holding back our growing team
As we doubled in size over 2022, we had to onboard a large number of new developers. This growth exacerbated the existing issues, such as version incompatibilities and local development stability issues, with our tooling and processes. We noticed the same issues coming up again and again as we onboarded new developers and required our tools to meet the demands of our growing team.
We also had bad habits that existed as a result of our old tooling (like running our development commands with sudo). This issue alone introduced many permission issues across developers' file systems. Similarly, depending on when you joined, you were probably advised to use a different process for installing Node and pnpm (our Node package manager of choice). This caused many inconsistencies and made solving issues take longer than expected because of all the upfront debugging required.
We are also a distributed team. We have developers in Cape Town and Johannesburg in South Africa, Nigeria, Kenya, and the Netherlands. With our team being largely remote, the primary path for solving local development issues was to send someone you hoped had the answer a private message asking for help.
Limitations of Slack
Initially, we set up Slack channels to try and alleviate the problem. Doing so worked fine, but it required:
- Anyone with an issue to ask for help (which can be hard).
- Someone who knows how to fix the issue to spend time troubleshooting (which can be expensive).
- Large amounts of back-and-forth, getting the person experiencing the issue to run a slew of commands to obtain the information required to debug things.
After a few months of these consolidated channels being the primary means for solving issues, we noticed there was a pattern of problems and that many of them had shared root causes. We managed to remove most of them by adjusting the way we ran our tooling, but to fix the rest manual intervention was still required. Around this time, we had introduced zx as a scripting language to replace our existing Bash scripts (as an aside, this has dramatically increased the number of people willing to write and maintain scripts).
Having identified the common issues, introduced a dramatically simpler scripting language, and knowing that many of the issues we faced could be detected and solved automatically, we decided to do what we do best. Write an automation script!
Scripting automated solutions saved us time and effort
Common issues that kept popping up
We had already collected the common issues in an attempt to remove the root causes for as many as we could (most of them were disallowing running various tools with sudo). We took the remaining issues and spent time figuring out if they had anything in common. From those, we got a list of root causes. We prioritised these from most to least common. Here’s the list of checks we ended up with based on those root causes:
- Xcode installation validation
- Node version and installation
- pnpm version check
- MacOS version check
- TouchID enabled for sudo
- Check the user’s GITHUB_TOKEN is valid, can access our monorepo, and can access our remote packages
- Docker is installed correctly, is running, and has the correct secrets required to pull remote images
- Check all the file paths relevant to local development have the correct owners and file permissions
- Check Docker is at least the most recent recommended version
- Check that Docker has the correct settings for optimised performance
As you can see, there’s a lot of installation validation, file system checks, and config validation. All of these caused countless issues individually. The various issues were often hard to solve because the errors of the applications we ran would mask the true issues, like file permissions being incorrect or the token being used not having the correct scopes. These were the root causes for almost all of the local development issues we used to have. Solving them prevented a host of issues from ever being a problem again.
Creating a local-dev-check script
Traditionally, we had used Bash as the scripting language of choice. To enable more people to write and help maintain scripts, we moved to writing them in TypeScript using the zx library. When writing our local-dev-check script that would automate the solutions to our problems, we opted to use these new tools too, since the script needed to be easy to extend when people found other common issues in the future.
We then used Plop to generate code for common things like microservices, scripts, and shared packages. Having templated code generation allows us to make sure we remain consistent with our setup and configuration. We have a template for our zx scripts that we used to boilerplate the local-dev-check script.
$ pnpm plop
> stitch-root@1.0.0 plop /Users/jethro/stitch
> plop --plopfile ./plop/plopfile.mjs --dest ./
[PLOP] Please choose a generator. ts-bin-script - Create a new TypeScript script file
What is the script's name? local-dev-check
âś” ++ /scripts/local-dev-check.ts
âś” +- /package.json
âś” -> Linked stitch-root successfully!
After that, we could write code like this to do our checks and scripting:
async function checkNodeVersion(version: string) {
printTitle('Checking Node Version');
const { stdout: versionResult } = await $`node -v`.quiet();
const nodeVersion = versionResult.trim().replace(/^v/, '');
const correctNodeVersion = satisfies(nodeVersion, version);
if (!correctNodeVersion) {
failingSteps.push(`Invalid node version - found '${nodeVersion}' - expected '${version}'`);
print.warning(
`⚠️ Incorrect node version - found '${nodeVersion}' - expected '${version}'\nPlease use Node ${version}`
);
return;
}
printSuccess(`Node version '${nodeVersion}' found!`);
}
await checkNodeVersion('18');
We made sure that we added helper methods and a consistent format to the script to make sure it was as easy to extend as possible. Since adding the initial script, we’ve been adding checks to the script as they come up in Slack, so we only ever have to solve an issue once.
One of the benefits of using TypeScript was that some of the checks could be done using TypeScript itself instead of having to call out to OS CLI tools which makes it much easier to understand and debug. For example, the file permissions checks all use the native Node fs
package to walk the various directories we want to check. We can do most of this work asynchronously, so it runs very quickly even though we’re running on Node via ts-node.
The benefits of automated scripting
Validate local developments environments
Most of the developers who ran the script initially found some issue with their local development setup that they didn’t even realise was there. For example, many people found files that were owned by the root user because of how we had to run other unrelated developer tooling in the past. Fixing that often fixed seemingly unrelated issues.
Having a consistent, clean development environment makes debugging any other issues far easier, which also helped everyone in Slack when debugging specific local development issues in the future. Every time someone has an issue, we get them to run the local-dev-check script as a first step both to get the information it outputs and to make sure we’re starting from some baseline level of “correctness”.
We’ve also gained the ability to check key areas of people’s systems to make sure there aren’t any unintentional vulnerabilities. For example, GITHUB_TOKENs that have no expiry date set or have additional unnecessary scopes.
Consistent tooling updates
The scripting has allowed us to add recommendations for various tooling versions like Node, pnpm, Docker, and MacOS. We can trial new versions out, and once we’ve confirmed they’re safe, we add them as the latest minimum version in the local-dev-check so that when people run it in the future, it prompts them to upgrade.
Consistency and trust in tooling updates might not benefit from an easy approximation like it does for issue debugging, but, generally, each update we employ gives us access to more useful features and generally helps us move faster. A good example of this is Docker Desktop for Mac which has introduced a slew of virtualization, file-system, and build improvements over the last year, which has dramatically increased the local build times of our development Docker images.
Saves us time for more productive work
All of these things have the same core benefit for the team and Stitch as a whole. Developers get to save time. Before the script, we would spend around 20 minutes a day on average debugging issues.
Twenty minutes a day over a working month of 22 days is a total of 440 minutes (or 7.3 hours). And that’s just the time spent helping new joiners debug their issues. Developers will often try to solve the issue themselves first before asking for help, so the total amount of time spent on an issue is often much higher. These time savings are mostly ongoing and generally scale with the size of the organisation. And now we can spend 7.3+ hours on more productive work/solving more important issues.
Improves confidence and morale
It’s hard, especially as a new person, to be constantly asking for help with things they think should be easy to solve. This is made worse when running into file or token permission issues that often have opaque or hard-to-decipher error messages.
By automating away these issues, everyone has easy access to a quick fix for most issues, freeing them from the mental strain of solving a common issue and giving them confidence. The morale and mental benefits of having an automated way to solve your issues can’t be overstated.
Conclusion
As we continue our growth, dealing with these issues proactively is essential to make sure our engineering team is as efficient as possible. After we’ve encountered a preventable issue once, having anyone else needlessly struggle with the same issue again is a waste when we have the tools to prevent it.
There are opportunities for automation in every business. Finding those and taking advantage of them could provide a massive boost to the number of productive hours each person has a month. Even small automations can have a large impact over the lifetime of a business. You won’t know how much you could save until you quantify how much time is spent, how often it’s spent, and by how many people.
Jethro Muller is a senior full stack developer at Stitch. He primarily works in Typescript on server-side NodeJS code. He enjoys ORM query optimisation, building pipelines and tooling, optimising workflows, and playing with AI.