Reposurgeon, for high fidelity source control system migration

The best time to blog about something is when it happens. The second best time is when you remember years later that you should have blogged about it. That’s now.

I’ve worked on complex source control system migrations, moving between various systems, most commonly SVN to Git. There are hundreds or thousands of tools and scripts around the Internet suggested for every plausible migration pair; almost all of which don’t even attempt to solve the whole problem.

The closest I seem to solving the whole problem is reposurgeon:

reposurgeon’s web page

reposurgeon source code on Gitlab

The strength of this tool is that it is intended to be scripted. Rather than doing a single-shot conversion, the workflow is:

  1. Attempt the conversion
  2. Study the results
  3. Tweak the conversion script (which can perform extensive and complex changes to the source code history on its way through)
  4. Repeat until approximately perfect

Teams using an old system continue doing so while the migration is worked on. Only once the migration has been perfected, is it time to cut over.

By scripting, I don’t mean that you, the user, must write scripts to do the basics of source control system history migration; that is the job of the tool. Rather, script to patch up the ugly bits of history in an old system during the translation. For example, in a moment of desperation, did somebody once merge a giant change to the mainline, something like rolling back the last three months of development, to try to get a deployable old build? That’s an easy bit of the old history to leave out during a reposurgeon-powered migration.

During migration you can translate usernames, branch names, details buried inside commit messages, and any other aspect you might wish to clean up programmatically. Think of it as a multi-source code system compatible analog of git-rewrite-branch, except for an entire repository, not a branch.

One major downside, as of the last time I used reposurgeon: it operates in memory, so you’ll need enough RAM for the whole source code history. This can typically be accommodated, even on quite large code bases, by temporarily allocating an extremely large compute instance on your cloud provider.