Cleanin’ files and git’in wild — migrating a large repository into a new monorepo

Yikes. How on earth did we manage that?

Published in

HMH Engineering

12 min readDec 6, 2019

The ancient monoliths at Stonehenge (not to be confused with the new monorepo at HMH)

Don’t ask me how, but despite my relative newness to HMH, I got roped into an interesting problem that kept me quite busy for the last couple of months. We needed to migrate one of our existing code repositories*, used in our learning platform Ed, into a new monorepo (cue Aislinn frantically googling “monorepo”). While I’m not going to go into the details of the monorepo concept here, there are plenty of articles online about the pros and cons. In any case, I’m not complaining — one of my favourite things about being an engineer is learning new things every day and getting paid for it!

So, back to the problem at hand. Everyone in our UI team was depending on this process to go smoothly so that we could tackle our next phase of work — no pressure. This is also something we had never done before in the company, but is likely to be something we’ll need to do again in the future.

Let’s break it down and see how it went. For the purposes of this discussion (but mostly for my own amusement), I’ll call the repository to be moved Cria and the monorepo Alpaca.

Photo of a llama, or possibly an alpaca…

Step 1: Define the problem

Immediately it was apparent that this task was not going to be a simple “lift and shift”. Cria was a bit of a monorepo in its own right, with its own webpack configuration, Jenkins/Docker files, linting settings, SonarQube properties, Storybook — the whole shebang. It was also massive in size (nearly 3 GB) due to some legacy generated files that still existed in the git history, no bueno.

Don’t do it alone!

After I had a quick googlin’ and became familiar with the monorepo, I gathered a few members of the UI engineering braintrust and we collectively decided what exactly we wanted to achieve during the migration:

Preserve the git history of Cria when merging it into the monorepo — we didn’t want to lose precious context from the commit history that Git had stored up for us over the last few years.
Reduce the size of Cria to avoid bloating the codebase of the monorepo.
Refactor the code structure and its builds so that Cria would be just another app in the monorepo.

Thanks to the help from my teammates, I now had a clear goal — so far, so good.

Step 2: Identify the steps to solve the problem

It would be worth mentioning that JIRA was my best friend from this point on during the whole process. Writing out the goal and my proposed solution helped to focus my thoughts and think about the wider impact on the rest of the team, and also allowed my peers to review my thought process at any stage and offer advice. By the end of this process, the JIRA ticket I was using was littered with subtasks, linked dependencies, and lots of comments — which is proving to be super helpful now in writing this blog post!

After a lot of chin-scratching and thoughtful tea-drinking, I came up with the following plan:

1) Ensure that Alpaca is ready for new developers

Thankfully, there was a dedicated team looking after the monorepo, so I didn’t have to do much on this step. I just put on my QA hat and made sure I could get Alpaca running locally, and the team got the ball rolling on getting the docs up to date for everyone to start working on it.

2) Clean up the git history in Cria

First thing that we noticed immediately here was that we would need a complete code freeze on the Cria repository, to prevent anyone merging in a branch that still had references to the files we were trying to delete.

As we knew the productivity impact of a code freeze would be significant, everything we suggested for this plan was tested separately on copies of repositories and build jobs, timed to the hour (although estimated to the half day in classic cautious engineer fashion), and communicated upfront.
We’ll get to that part a little later.

I made a list of generated files that we checked into the repo in the early days of getting our strangler pattern up and running for Ed (these were the files bloating up the size of git history), and validated the list with the braintrust. I found some help online about purging large files from the git history of a repository, and I ended up testing 3 methods for doing so, with the following results:

A clear winner there! Some pain points and learnings:

I wish that I had started with BFG, and not with the filter-branch method — git filter is super powerful and will allow you to do things BFG can’t, but if you’re just doing a straight removal of files, BFG is much faster.
The aggressive flag was fine to leave out in this case, as I confirmed with the original developer of the BFG tool on Twitter (Twitter and Google, where would we be without them):

Closing all the PRs and deleting as many branches as possible from Cria sped this up during the actual migration (merging the two repositories, as described a couple of steps below this, doesn’t take any other branches or issues with you — just the master branch and its history).
I used a forked copy of Cria in the first couple of tests, but I was seeing strange results on the size of the repository on forked copies — after removing the files, the size wasn’t going down. That’s why I switched to using separate duplicated copies during the tests, but it didn’t help much. No matter what I did, I was seeing that the size of the cleansed repo wasn’t going down, but when I merged it with a copy of the monorepo, the monorepo size was only going up marginally. It didn’t make any sense!
In actual fact, the problem was the branches. During the real migration, once I had closed the PRs and deleted as many branches as we could, the size of the repository after cleaning the .git history went way down from ~2 GB to ~70 MB. Phew.
Just to add to the confusion, when I say the size went up or down — sometimes it took a couple of hours or days to see the size change. Some information that I found here and here include off-hand references to caching on GitHub and GitLab — our suspicion is that it can take a while for some sort of garbage collection activity to take place after you run git gc.
I managed to get my hands on an extra laptop to run the tests concurrently on multiple copies of Cria, which helped speed up the testing process. After a couple of failed executions, I also learned to ensure the laptop would never sleep during my tests by updating my Mac Energy settings:

A picture of the settings required for a Mac not to sleep — the environmentalist in me is still cringing at this.

For reference, the full instructions for the chosen approach are as follows:

Duplicate your repository into a fresh repository
Create a mirrored copy of the repo locally:

$ git clone --mirror <GIT_REPO>

Download the BFG tool into the root of that local repository as “bfg.jar”.
Ensure Energy Settings on laptop are set to never sleep.
Download the bash shell “remove-generated-files.sh” as detailed below into the root of the local repository.
Edit the references to the directories, the files to be deleted, and .git files (all marked by `<>`), and run it.
Time how long it takes by looking at the date outputs.
Look for a specific commit in the history of the repository and check if the files are gone.
Check the size of the repository after a couple of hours or even days (to allow for GitHub garbage collection, as mentioned above).

Code snippet we used to purge files from a git repository: https://gist.github.com/ahayes91/454c0f84d0af377cd273cbc56ca0cf51

Our test for success was to navigate to a particular commit in the history where a file was present on the original Cria (I had to do this via the GitHub UI, because running these commands inherently changes all the SHAs of commits so I couldn’t URL to a commit directly) and see if the file was now not present.

Screenshot of a commit prior to purging the files from the history — the commit shows 7 changed files.

Screenshot of the same commit after purging the files from the history — the commit now shows only 1 changed file.

3) Move Cria into the intended app directory pattern, and get the current builds working with this new pattern

This boiled down to a couple of sub-steps:

Figure out with the Alpaca team what Cria should look like in the monorepo.
Minimise the impact of/avoid at all costs any Cria-specific configuration at the root of the monorepo (e.g. unit tests/jest configuration, linting settings, hooks, package dependencies).
Do a full analysis of the existing builds and how they should work alongside the Alpaca builds.

Once I had that information, I did the updates on a copy of Cria and made copies of our builds in Jenkins that I could reconfigure and test myself. I removed any steps that actually deployed anything from the Jenkins/Dockerfiles, and then it was a process of incremental change and build-kicking until I had them running successfully.

4) Merge the two repositories

This one was actually much easier than I thought it would be — a handy article here explained exactly how to do this:

$ git remote add -f REPO <REPO>
$ git merge REPO/master --allow-unrelated-histories
$ git push

A couple of tests on my copies of the repositories showed the git histories successfully merged, if I remember correctly, the real run at this step took less than an hour.

A couple of things to note here:

Make sure all PRs are closed and as many branches are deleted as possible from the repo you’re trying to migrate — it will speed things up.
Your PRs, issues, and branches (except master) won’t be brought with you to the new repository — only the commits on the master branch will be merged in.
When I did this step during the actual migration, we implemented a code freeze on both repositories and removed any branch restrictions so that I could do this merging step directly on Alpaca’s master branch. It worked for us in the interest of saving time (and also knowing that we had to get it working no matter what), but ideally you would do the merging on a separate branch and submit a PR (if it works). If we ever get a chance to test that out I’ll come back and update this post!

5) Make the old repository read-only

We use GitHub Enterprise here at HMH, and Terraform for administration of repositories — plenty of new jargon for me, so I ferreted out one of our experts, established what we’d need to do to configure the old repository & created a PR, and made sure the admin would be available to flip the switch when the time came.

6) Make sure all commands still work locally

This part was the trickiest, both during testing and during the real migration. We needed to get all the unit tests running, the linting commands, the local builds, and our end-to-end automated tests running locally.

There were lots of jest configuration changes, messing with webpack and babel settings, and linting changes required here. It’s hard to go into the detail here without giving away the whole coding farm, but here are some highlights:

Our monorepo, as a relatively new codebase, had more stringent linting rules that our legacy Cria repository. This meant we had to turn off or switch a lot of rules to “warn” at the root .eslintrc.json “overrides” level, and we also had to raise stories in our backlogs to tackle this technical debt across the teams.
Same for our prettier settings — for example, it seems like a silly thing, but the root level monorepo requires double-quotes, whereas Cria requires single quotes, and using double quotes in the legacy Cria code broke it somewhere. In the interest of saving time during the migration process, this was also chalked down to technical debt to be properly investigated and solved later.
It took us a while to get the jest / babel / webpack settings completely correct — the only advice I can really give here is to experiment with package versions, both fixed and using caret (^).
Different React versions and different Material UI versions in particular cause a lot of hassle in a monorepo — if you can manage it, try to streamline the versions of these between the apps.
Once you’ve got all the commands working, make sure you update the README.md files both at the root level and at the app level, otherwise, your fellow engineers will find it trickier to get started!

7) Point the builds to Alpaca and ensure they still work

Again, this step I tested on my copies of the builds — we had to do some changes to the builds to make sure they had the right access keys and tokens, and a few configuration changes based on the new file directories. I did a few actual deployments of our Docker images to Artifactory with some dummy tags, but nothing to our environments until the real migration.

8) Communicate the approach and time-frame to stakeholders and agree a kick-off date

Once I had a full run through of all the previous steps on my test repositories and build jobs, I was able to create a detailed step-by-step plan (on JIRA) with potential owners and timelines, with a structure like this:

After I was confident in the plan I set up a meeting with a few technical leads, product owners, and delivery leads to ensure all our stakeholders in this process were happy with the plan, and to identify any risks and their mitigation.

One simple thing that came out of this meeting was the idea that, when doing the actual migration, we would replicate the Jenkins jobs, rather than updating the old ones. This was to avoid losing anything important in the configuration in case we had any bugs or issues that we needed to fix urgently. Jenkins, as you may or may not know, has no change control or change history — always annoying and in this case, quite risky!

At this point, I was able to relax and wait for the green light from management to begin the process —we also posted a lot of updates to Slack during this period, explaining the work that was about to happen, what people could expect to see in the migration, and how to get back up and running when it was completed. Given that we have teams in multiple locations around the world, this exercise in over-communication was very important.

9) Implement the move, test it, deploy to production

After all that testing? Easy peasy!

10) Identify and prioritise any knock-on clean-up tasks

Like I mentioned already, we had a lot of linting errors, some doc server updates, and a few other bits and pieces left to clean-up that could be done in parallel with everyone getting back to work after the code freeze. We had to document these as stories in JIRA and nudge our engineering leads to make sure this tech debt would be tackled, pronto.

Step 3: Solve the problem!

I cheated a bit and included the results of what actually happened in the steps above, but this section is a nice place to recap on the results:

15 working days of planning
5 working days of a merge freeze for the whole team
7 working days overall to make it from initial merge freeze to production

And after all that, Aislinn had an iron-clad excuse to take a holiday to Rome!

The author looking very touristy in the Colosseum in Rome and not thinking about git at all!

*If you want some more background on how we ended up with multiple repositories in the first place, check out my pal Clíona’s post on strangler applications.

Are you patient, or in this case, stubbornly unrelenting in the face of new problems? HMH would love to have someone like you on the team — come join us!