Several years ago I started a project that was stored in a Subversion repository. After some time, the current (at that time) version of the code was used to create a new Git repository and the development continued. Several months and hundreds of commits later, I decided to gather the code from both repositories into a single repository and keep all the historical data intact.
The goal was to get all the code since the project started until the most recent version into a single Git repository as we have used Git from the project beginning.
I’ll explain below how to accomplish this goal.
I started by using SubGit to import the Subversion commits into a new Git repository.
I could do the import with
git svn but SubGit does a better job.
In a graphical fashion, the starting setup looks like this:
o <- second-head, second-master \ | | o | /|\ | ... | 'second-repo', the newest code \|/ | o | | | o <- second-root / . . <-- the desired link (it does not exist now) . o <- first-head, first-master \ | | o | /|\ | 'first-repo', the oldest code ... | \|/ | o /
We want to create a link between commits
first-head the parent of
- preserve the code changes and author date and email of each commit imported from
- preserve the commit times of the commits imported from
- preserve the branches and merges from
The first goal is automatically achieved by
git rebase. It does not change the content of the commits
it handles and it is very carefully with the authors information too.
In theory, all we need to do is:
$ git rebase first-head second-head
We’ll discover that, while it copies the commits from the second repository on top of
makes the history linear (flattens the branches and merges) and it sets the
committer date of all the
copied commits to the current date&time. It does not comply with our second and third items from the list
Paying more attention to
git help rebase, we’ll discover that adding the option
git to copy the
author date as
committer date for the commits it handles. While this still
does not preserve the original commit dates, it is however pretty useful. Usually the
committer date is
the same as
author date. They do not match for amended commits, rebased commits and commits submitted
as patches through email. We could live with that but it still does not match the third item from our list.
And it is an important one because the past branches and merges shape the history of the code base.
Looking more thoroughly into the help, we’ll discover the option
--preserve-merges that helps
accomplish our third goal. The branches and merges are replicated correctly but, unfortunately, the
commit dates are again set to the current date&time.
Nothing is wrong. The help explains:
--preserve-merges internally uses the
--interactive machinery and
--committer-date-is-author-date is incompatible with
Apparently this is a dead end.
I did some research on the Internet and I found a partial solution in an answer on StackOverflow. It is not completely baked, it even fails with a syntax error, but it helped me to find the right path and the complete solution.
The solution involves several steps:
- prepare a new working repository; get all the required commits into it and mark the important ones with branches;
- create the missing link between
second-root; force its creation as Git will, most probably, complain;
- rebase the other commits between
- fix the
committer datefor all the commits affected by the previous two steps;
Let’s start with the first repository (the older code) in
./first-repo and the second repository
(the newer code) in
Let’s create a new repository in
./merge-repo and do all the work there. We’ll clone the first
repository, add the second one as a remote and fetch its commits.
$ mkdir ./merge-repo $ cd ./merge-repo $ git clone ../first-repo . $ git remote add second-repo ../second-repo $ git fetch second-repo
Next we’ll create some branches to point at some special commits: the first and the last commits from the second repository:
$ git branch second-head second-repo/master $ git branch second-root $(git log second-head --reverse --pretty=%H | head -n 1)
The most recent commit of the first repository (this is where we will link
$ git branch first-head master
We’ll rename the
master branch (it points to the most recent commit of the first repository) to
first-head. We will create another
master branch after everything is completed.
$ git branch -m master first-master $ git branch second-master second-repo/master
Finally, we remove all the remotes to keep the working repository isolated.
$ git remote remove origin $ git remote remove second-repo
This way, if something goes wrong we can just remove the
./merge-repo directory and start over.
Backup the commit dates
Save the tree hash and the commit time (Unix timestamp) of the commits from the second repository to a file. We’ll use these to restore the original commit times after the rebase. The tree hash is used to identify each commit. We could also save the commit hashes to the file but they are of no use because they change after the rebase. However, the tree hashes do not change because the rebase does not modify the content of the affected commits, only their parents and commit time.
$ git log --pretty='%T %ct' ..second-head > /tmp/hashlist
first-head the parent of
Since we are happy with the files from both repositories and just want to paste
second-root on top of
first-head, any potential conflict must be resolved using the files from the applied commit (
$ git cherry-pick --strategy-option=theirs second-root
git to apply
second-root on top of
first-head and use the information from
to solve any conflict that appears.
Copy the rest of the commits from the second repository
Try the rebase:
$ git rebase --preserve-merges --onto first-head --root second-head
It will stop with an error like this:
$ git rebase --preserve-merges --onto first-head --root second-head The previous cherry-pick is now empty, possibly due to conflict resolution. If you wish to commit it anyway, use: git commit --allow-empty Otherwise, please use 'git reset' rebase in progress; onto cffbb1c You are currently rebasing branch 'second-head' on 'cffbb1c'. nothing to commit, working directory clean Could not pick 1f7f7036025ac1d48973818b1602fc9aa91731fb
It basically complains that it cannot find any difference between
and it is entirely right; using the previous
cherry-pick we just applied the commit
on top of the original
first-head and now
first-head looks identical with
Let’s just tell Git to ignore this commit and continue:
$ git rebase --skip
This would take a while (depending on the size of your second repository) and it should complete successfully. If it fails then you are on your own. But it has no reason to fail.
Fix the committer dates
rebase operation keeps most of the meta-data of the commits it changes. It changes the commit hash,
of course, and it also changes the committer date (using the current date). We want to keep the original
committer date (this is the entire point of this article after all).
We can “fix” the original committer dates using a bit of magic:
$ git filter-branch --env-filter 'export GIT_COMMITTER_DATE=$(fgrep -m 1 $(git log -1 --pretty=%T $GIT_COMMIT) /tmp/hashlist | cut -d" " -f2)' first-master..second-head
In plain English,
git filter-branch lets you rewrite Git revision history by applying custom filter
on each revision. Our custom filter identifies the commit to be changed by its tree hash, finds the
corresponding commit date into the backup file we created earlier and uses the
environment variable to set the desired
committer date to the commit being processed.
If something goes wrong
The previous position of the
second-head branch can be found in the file
$ cat .git/refs/original/refs/heads/second-head
To revert the
$ git reset --hard $(cat .git/refs/original/refs/heads/second-head)
Before trying to
git filter-branch again, the backup ref file must be deleted (
to run if it founds it):
$ rm .git/refs/original/refs/heads/second-head
After the successful linking, the current branch is
second-head and we have some branches pointing
to various commits involved in the process. We can rename
master and remove the other branches.
$ git branch -m second-head master $ git branch -D first-head $ git branch -D second-root $ git branch -D second-master $ rm .git/refs/original/refs/heads/second-head
first-master is still there, pointing to the
master branch of the first repository. You may
probably want to keep it as reference (or, better, create a tag pointing on that commit.)
Remove the hash file:
$ rm /tmp/hashlist
Only the current branch from the new repository will be appended to the old repository; any dangling branch needs to be rebased individually after the process completes; the same technique could work, given the join points are set up correctly.
git help filter-branch:
Note that since this operation is very I/O expensive, it might be a good idea to redirect the temporary directory off-disk with the -d option, e.g. on tmpfs. Reportedly the speedup is very noticeable.
It took a couple of seconds for me, for about 2,500 commits but it is not relevant because my repository was stored on a SSD.
Because of the rebase, ALL the commits from the newer repository changed their hashes. If the repository is published this will puzzle the other contributors. Before attempting this stunt, make sure that all the important branches are merged, everybody knows what’s going on and how to catch up and continue afterwards without losing their work.
You have been warned!