axiac@web

Join two Git repositories and keep the original commit dates

Several years ago I started a project that was stored in a Subversion repository. After some time, the current (at that time) version of the code was used to create a new Git repository and the development continued. Several months and hundreds of commits later, I decided to gather the code from both repositories into a single repository and keep all the historical data intact.

The goal was to get all the code since the project started until the most recent version into a single Git repository as we have used Git from the project beginning.

I’ll explain below how to accomplish this goal.

I started by using SubGit to import the Subversion commits into a new Git repository. I could do the import with git svn but SubGit does a better job.

The request

In a graphical fashion, the starting setup looks like this:

 o <- second-head, second-master    \
 |                                   |
 o                                   |
/|\                                  |
...                                  | 'second-repo', the newest code
\|/                                  |
 o                                   |
 |                                   |
 o <- second-root                   /
 .
 . <-- the desired link (it does not exist now)
 .
 o <- first-head, first-master      \
 |                                   |
 o                                   |
/|\                                  | 'first-repo', the oldest code
...                                  |
\|/                                  |
 o                                  /

We want to create a link between commits first-head and second-root (make first-head the parent of second-root) and:

  • preserve the code changes and author date and email of each commit imported from second-repo;
  • preserve the commit times of the commits imported from second-repo;
  • preserve the branches and merges from second-repo.

The first goal is automatically achieved by git rebase. It does not change the content of the commits it handles and it is very carefully with the authors information too.

In theory, all we need to do is:

git rebase first-head second-head

We’ll discover that, while it copies the commits from the second repository on top of first-head, it makes the history linear (flattens the branches and merges) and it sets the committer date of all the copied commits to the current date&time. It does not comply with our second and third items from the list above.

Paying more attention to git help rebase, we’ll discover that adding the option --committer-date-is-author-date tells git to copy the author date as committer date for the commits it handles. While this still does not preserve the original commit dates, it is however pretty useful. Usually the committer date is the same as author date. They do not match for amended commits, rebased commits and commits submitted as patches through email. We could live with that but it still does not match the third item from our list. And it is an important one because the past branches and merges shape the history of the code base.

Looking more thoroughly into the help, we’ll discover the option --preserve-merges that helps git accomplish our third goal. The branches and merges are replicated correctly but, unfortunately, the commit dates are again set to the current date&time.

What’s wrong?

Nothing is wrong. The help explains: --preserve-merges internally uses the --interactive machinery and --committer-date-is-author-date is incompatible with --interactive.

Apparently this is a dead end.

Really?

I did some research on the Internet and I found a partial solution in an answer on StackOverflow. It is not completely baked, it even fails with a syntax error, but it helped me to find the right path and the complete solution.

My solution

The solution involves several steps:

  • prepare a new working repository; get all the required commits into it and mark the important ones with branches;
  • create the missing link between first-head and second-root; force its creation as Git will, most probably, complain;
  • rebase the other commits between second-root and second-head;
  • fix the committer date for all the commits affected by the previous two steps;
  • cleanup.

Preparations

Let’s start with the first repository (the older code) in ./first-repo and the second repository (the newer code) in ./second-repo.

Let’s create a new repository in ./merge-repo and do all the work there. We’ll clone the first repository, add the second one as a remote and fetch its commits.

mkdir ./merge-repo
cd ./merge-repo
git clone ../first-repo .
git remote add second-repo ../second-repo
git fetch second-repo

Next we’ll create some branches to point at some special commits: the first and the last commits from the second repository:

git branch second-head second-repo/master
git branch second-root $(git log second-head --reverse --pretty=%H | head -n 1)

The most recent commit of the first repository (this is where we will link second-root):

git branch first-head master

We’ll rename the master branch (it points to the most recent commit of the first repository) to first-head. We will create another master branch after everything is completed.

git branch -m master first-master
git branch second-master second-repo/master

Finally, we remove all the remotes to keep the working repository isolated.

git remote remove origin
git remote remove second-repo

This way, if something goes wrong we can just remove the ./merge-repo directory and start over.

Backup the commit dates

Save the tree hash and the commit time (Unix timestamp) of the commits from the second repository to a file. We’ll use these to restore the original commit times after the rebase. The tree hash is used to identify each commit. We could also save the commit hashes to the file but they are of no use because they change after the rebase. However, the tree hashes do not change because the rebase does not modify the content of the affected commits, only their parents and commit time.

git log --pretty='%T %ct' ..second-head > /tmp/hashlist

Make first-head the parent of second-root

Since we are happy with the files from both repositories and just want to paste second-root on top of first-head, any potential conflict must be resolved using the files from the applied commit (second-root):

git cherry-pick --strategy-option=theirs second-root

This forces git to apply second-root on top of first-head and use the information from second-root to solve any conflict that appears.

Copy the rest of the commits from the second repository

Try the rebase:

git rebase --preserve-merges --onto first-head --root second-head

It will stop with an error like this:

$ git rebase --preserve-merges --onto first-head --root second-head
The previous cherry-pick is now empty, possibly due to conflict resolution.
If you wish to commit it anyway, use:

    git commit --allow-empty

Otherwise, please use 'git reset'
rebase in progress; onto cffbb1c
You are currently rebasing branch 'second-head' on 'cffbb1c'.

nothing to commit, working directory clean
Could not pick 1f7f7036025ac1d48973818b1602fc9aa91731fb

It basically complains that it cannot find any difference between second-root and first-head and it is entirely right; using the previous cherry-pick we just applied the commit second-root on top of the original first-head and now first-head looks identical with second-root.

Let’s just tell Git to ignore this commit and continue:

git rebase --skip

This would take a while (depending on the size of your second repository) and it should complete successfully. If it fails then you are on your own. But it has no reason to fail.

Fix the committer dates

The rebase operation keeps most of the meta-data of the commits it changes. It changes the commit hash, of course, and it also changes the committer date (using the current date). We want to keep the original committer date (this is the entire point of this article after all).

We can “fix” the original committer dates using a bit of magic:

git filter-branch --env-filter 'export GIT_COMMITTER_DATE=$(fgrep -m 1 $(git log -1 --pretty=%T $GIT_COMMIT) /tmp/hashlist | cut -d" " -f2)' first-master..second-head

In plain English, git filter-branch lets you rewrite Git revision history by applying custom filter on each revision. Our custom filter identifies the commit to be changed by its tree hash, finds the corresponding commit date into the backup file we created earlier and uses the $GIT_COMMITTER_DATE environment variable to set the desired committer date to the commit being processed.

If something goes wrong

The previous position of the second-head branch can be found in the file .git/refs/original/refs/heads/master

cat .git/refs/original/refs/heads/second-head

To revert the git filter-branch:

git reset --hard $(cat .git/refs/original/refs/heads/second-head)

Before trying to git filter-branch again, the backup ref file must be deleted (filter-branch refuses to run if it founds it):

rm .git/refs/original/refs/heads/second-head

Cleanup

After the successful linking, the current branch is second-head and we have some branches pointing to various commits involved in the process. We can rename second-head to master and remove the other branches.

git branch -m second-head master
git branch -D first-head
git branch -D second-root
git branch -D second-master
rm .git/refs/original/refs/heads/second-head

The branch first-master is still there, pointing to the master branch of the first repository. You may probably want to keep it as reference (or, better, create a tag pointing on that commit.)

Remove the hash file:

rm /tmp/hashlist

Remarks

  • Only the current branch from the new repository will be appended to the old repository; any dangling branch needs to be rebased individually after the process completes; the same technique could work, given the join points are set up correctly.

  • Extras from git help filter-branch:

Note that since this operation is very I/O expensive, it might be a good idea to redirect the temporary directory off-disk with the -d option, e.g. on tmpfs. Reportedly the speedup is very noticeable.

It took a couple of seconds for me, for about 2,500 commits, but it is not relevant because my repository was stored on a SSD.

  • Because of the rebase, ALL the commits from the newer repository changed their hashes. If the repository is published this will puzzle the other contributors. Before attempting this stunt, make sure that all the important branches are merged, everybody knows what’s going on and how to catch up and continue afterwards without losing their work.

    You have been warned!

Comments