I recently ran into an interesting VCS-challenge. Combine two separate SVN repositories into one, preserving the relevant part of the history in the process. The challenge at first sight seemed to be straightforward, but turned out to be quite a hairy deal. As usual, a software project cannot stand still for long, so active development was done during the whole transition. I will describe some of my findings, in the hope that the next person trying something similar does not have to bump into all the same obstacles in the process. But first, a small warning. This post will be longer and more technical than the usual Futurice blog post, so unless you feel like taking a deep dive into the mysteries of VCS, you should stop reading now. Consider yourself warned. Here are a few Do's and Don'ts for the impatient. I will then discuss the process in more detail below.
The two repositories that needed to be combined were a server repository and a separate client repository. Managing the dependencies and release cycles between the two had turned out to be too cumbersome for a fast paced, agile project, and it was slowing us down unnecessarily. The complication in the task was that the client repository had originally started as a branch of a larger codebase. The original small branch turned into a larger project, and soon matured to be a fully independent project, without any proper relation to the original parent project.
SVN actually provides tools for combining several repositories. The process of using svn admin dump and load preserves the full history of the repositories. The process does, however, not cope with any conflicts for paths. I don't consider myself an SVN expert, and after consulting some experts on the topic, we decided that this was way too error prone and cumbersome to be done in practice. For full repository dumps the approach might work, but since we wanted only the relevant parts of the history, i.e. one branch, this was not a suitable approach. One of the repositories contained approximately 25000 revisions, and the other one well over 3000, so the overhead of the extra changesets and the probability of multiple conflicts on the paths was too big a risk to take.
I have been using git-svn successfully as a SVN client for well over a year now, so the natural next approach was converting the repositories to git, and replaying the changes back to one of the SVN repositories. This approach could also be scripted in SVN-only fashion, but having the full history in git would make the process much faster. Git-svn allows for extracting specific branches into the repository v, which was one of the original goals of the migration. The problem with replaying changes was the hosted SVN server we were using. We did not have shell access to it, making scripting much harder. The SVN server automatically assigns the unix username to the commit, so had I just replayed the changes, the history would have shown me as the only collaborator. As a compromise, we agreed that we could instead include the original commit data in the new log message.
Svn blamewould be broken, but at least it would be possible to find the original author in the log. Git has the very handy filter-branch command, which lets the user do batch changes to the history. Changing common history is normally not a recommended approach, but in this case we saw it as the only option to keep svn as a viable solution. Git-svn adds some metadata to the commits, so they end up looking like this:
commit fedcbaf1821677f02c1c36adb99f5ed3d5a5728d Author: unix-user Date: Fri Jan 14 16:36:46 2010 +0000 LOG MESSAGE git-svn-id: svn+ssh://SVN-SERVER-AND-REPOSITORY@22222 11e32337-3d48-0410-9457-8905d971a4a1
The git-svn-id contains the original repository information, as well as the revision number of the SVN commit, i.e. 22222 in the example above. With a little bash+perl magic below, the log message could be made much more readable.
#!/bin/bash git filter-branch --msg-filter ' perl -pe "s/^s*git-svn-id: [^@]+@(d+) .*$/Revision:t$1/g" && echo "Author:t$GIT_AUTHOR_NAME" && echo "Timestamp:t"`perl -e "use POSIX strftime; print strftime("%Y-%m-%dT%H:%M:%S", localtime("$GIT_AUTHOR_DATE"))" ` ' $@
The operation takes several hours to run, even on my SSD equipped machine, but considering the one-time nature of the operation, this was acceptable. Filter-branch is also not restricted to changing log messages. It allows almost anything to be done to the history. For example, we played with the idea of moving the whole history of one of the projects to a suitable subdirectory before the merging, but found the approach too intrusive. It might compromise our ability to recreate earlier builds at a later time. However, figuring out the power of filter-branch made this trial and error a very educational one.
The results of rewriting the history were very promising. The flexibility of git, however, impressed the development team, and finally we decided to switch over completely to the world of DVCS. This way we could preserve the full history with the full commit information without having to rewrite log messages. We pondered over rewriting the history to reflect the new structure, but again rejected the idea for the same reason as before.
The task of combining two or more git repositories is easy. It is possible to pull in the changes from any unrelated repositories. The initial cloning of the SVN repositories is quite straightforward. Co-operating with svn, however, makes synchronizing the repositories between git and SVN quite a big problem.
Git-svn uses rebase strategy to synchronize the local repository with the SVN repository. Rebasing keeps the history linear by appending any changes done locally in git to the end of the chain of SVN changesets. This is not a problem as long as there are changes only on either git or svn, but maintaining changes in two systems becomes very work intensive after a while. I unfortunately did the mistake of creating the authoritative git repository at this point, contrary to the recommendation in the git-svn manual. At this point I also moved a lot of directories around in git, laying out the directory and project structure. The changes could not be committed back to SVN, thus laying out a stony path of synchronization work for myself. The transition took about two weeks, so there were a lot of changesets created on the SVN side, all of which had to be synchronized to git.
In order not to create discrepancies in the now authoritative git repository, it was no longer possible to rebase the handful of changesets with the new structure on the SVN changes, but instead I had to rebase the SVN changes on top of the git changes. The local repository in git makes rebasing easy, as even this complex operation can easily be reverted and started over. The git reflog command is of tremendous help in the process, keeping a log of all the changes being made. Unfortunately, my changes in git introduced some interesting error cases for the rebasing.
Git only tracks file contents, but not directories. Git tries to deduce which files have moved from one place to another. SVN on the other hand, treats both directories and files as first class objects, and only stores the file content as a property of the file-object. This distinction is very important when making git and SVN co-operate in the way that we were doing here. Git managed to do the rebase and deduce the new location of any old files being moved, and was thus able to carry on changes to the correct files. This, however, does not work properly for new files that were created in SVN after branching to git. All the new files were left at their original path, relative to the root of the repository. The fix was to manually move the directories to the new structure. Fortunately there were no complex renames of files in SVN, as this might make synchronizing significantly harder.
Having noticed the problem during the first repository migration made me change the strategy for the second one. Instead of just rebasing the changes, I decided to first revert any changes to the directory structure, in order to reduce confusion over new files in the repository. Revert is a safe, non-destructive operation as it creates new reverse changesets at the end of the history, thus keeping the history intact. Having reverted the directory renames, I was quite easily able to create a clean and linear history by rebasing all new SVN changes to git. This new approach saved me several hours of work of tracking down any possible missing files or conflicts due to git being unable to locate all changed files.
One final problem we had was a handful of late changes in the SVN repository. Once the rebase had been done, there is really no proper way to repeat the process. All the changeset ids get changed due to the rebase, so trying to do the same rebase a second time is bound to fail. Fortunately git supports cherry-picking, i.e. moving individual changesets from one tree to another. It was quite easy to loop over the missing changesets and cherry-pick them one by one to the new branch. Cherry-pick supports copying multiple changesets with a single command, but since there were only a handful of changes, I did it manually. This is the process that will have to be taken should there be any changes to the SVN repository later on. Because of this, however, it is very important to tag the key changesets to make sure it's easy to figure out which changesets, if any, have been fetched after the rebase operation.
The final step of the process was to merge the two separate branches into a single development branch and start hacking away. The positive sides of this approach is that we now have the full history of both the repositories, so we will be able to recreate any earlier build. We also managed to remote all the extra and non-related branches from the repository in the process. This reduced the total number of changesets to under 50%, which is really valuable since it reduces the noise in the logs.
There were, of course, quite a few things that could have been done differently, and in a much better fashion. The following two are the most prominent. The biggest mistake was to create and publish the authoritative git repository too early. Had I not done that, most of the extra work with rebasing and moving files in git could have been avoided. I probably spent an extra working day or two on the task because of this. The optimal solution is to migrate to git in one step, without having to migrate changes between repositories.
The second mistake was to not convert the unix usernames in the authors of svn to proper name + email pairs. Git-svn has an option of doing this by providing a file with mappings between SVN usernames and git usernames. Since this was not done in the first place, the history will now contain entries with both unix username based authors and name+email authors.
The project has now been using the new VCS and the new project structure for about a month. The reactions have been overwhelmingly positive, although unlearning SVN ways and embracing the git way has been challenging from time to time. Git is quite a lot more complex than SVN (or Mercurial), but given that most developers on the team had no previous experience with git or DVCS in general, there have been very few mistakes. The support for git on Windows is still a nuisance, as all the tools seem to do things a bit differently. So far we've used Egit for Eclipse, TortoiseGit and Git Extensions. On OS X the situation is much better, with tools likeGit Towerand of course old trusty bash with command line git. For Windows users, Mercurial does seem like a route of less resistance, though.
Git-svn is a great tool for a task like the one I have described, even with the problems in the process. A second go at the same task would probably be much easier after learning a number of things the hard way. Many of the problems I encountered were really things that you only know from previous experience, either your own or somebody else's. I would, however, be glad to hear and learn from you if you have done something similar and feel that you've got a better way of doing things.