Several years ago I started a project that was stored in a Subversion repository. After some time, the current (at that time) version of the code was used to create a new Git repository and the development continued. Several months and hundreds of commits later, I decided to gather the code from both repositories into a single repository and keep all the historical data intact.
The goal was to get all the code since the project started until the most recent version into a single Git repository as we have used Git from the project beginning.
I’ll explain below how to accomplish this goal.
I started by using SubGit to import the Subversion commits into a new Git repository.
I could do the import with git svn
but SubGit does a better job.
The request
In a graphical fashion, the starting setup looks like this:
o <- second-head, second-master \
| |
o |
/|\ |
... | 'second-repo', the newest code
\|/ |
o |
| |
o <- second-root /
.
. <-- the desired link (it does not exist now)
.
o <- first-head, first-master \
| |
o |
/|\ | 'first-repo', the oldest code
... |
\|/ |
o /
We want to create a link between commits first-head
and second-root
(make first-head
the parent of
second-root
) and:
- preserve the code changes and author date and email of each commit imported from
second-repo
; - preserve the commit times of the commits imported from
second-repo
; - preserve the branches and merges from
second-repo
.
The first goal is automatically achieved by git rebase
. It does not change the content of the commits
it handles and it is very carefully with the authors information too.
In theory, all we need to do is:
git rebase first-head second-head
We’ll discover that, while it copies the commits from the second repository on top of first-head
, it
makes the history linear (flattens the branches and merges) and it sets the committer date
of all the
copied commits to the current date&time. It does not comply with our second and third items from the list
above.
Paying more attention to git help rebase
, we’ll discover that adding the option --committer-date-is-author-date
tells git
to copy the author date
as committer date
for the commits it handles. While this still
does not preserve the original commit dates, it is however pretty useful. Usually the committer date
is
the same as author date
. They do not match for amended commits, rebased commits and commits submitted
as patches through email. We could live with that but it still does not match the third item from our list.
And it is an important one because the past branches and merges shape the history of the code base.
Looking more thoroughly into the help, we’ll discover the option --preserve-merges
that helps git
accomplish our third goal. The branches and merges are replicated correctly but, unfortunately, the
commit dates are again set to the current date&time.
What’s wrong?
Nothing is wrong. The help explains: --preserve-merges
internally uses the
--interactive
machinery and --committer-date-is-author-date
is incompatible with --interactive
.
Apparently this is a dead end.
Really?
I did some research on the Internet and I found a partial solution in an answer on StackOverflow. It is not completely baked, it even fails with a syntax error, but it helped me to find the right path and the complete solution.
My solution
The solution involves several steps:
- prepare a new working repository; get all the required commits into it and mark the important ones with branches;
- create the missing link between
first-head
andsecond-root
; force its creation as Git will, most probably, complain; - rebase the other commits between
second-root
andsecond-head
; - fix the
committer date
for all the commits affected by the previous two steps; - cleanup.
Preparations
Let’s start with the first repository (the older code) in ./first-repo
and the second repository
(the newer code) in ./second-repo
.
Let’s create a new repository in ./merge-repo
and do all the work there. We’ll clone the first
repository, add the second one as a remote and fetch its commits.
mkdir ./merge-repo
cd ./merge-repo
git clone ../first-repo .
git remote add second-repo ../second-repo
git fetch second-repo
Next we’ll create some branches to point at some special commits: the first and the last commits from the second repository:
git branch second-head second-repo/master
git branch second-root $(git log second-head --reverse --pretty=%H | head -n 1)
The most recent commit of the first repository (this is where we will link second-root
):
git branch first-head master
We’ll rename the master
branch (it points to the most recent commit of the first repository) to
first-head
. We will create another master
branch after everything is completed.
git branch -m master first-master
git branch second-master second-repo/master
Finally, we remove all the remotes to keep the working repository isolated.
git remote remove origin
git remote remove second-repo
This way, if something goes wrong we can just remove the ./merge-repo
directory and start over.
Backup the commit dates
Save the tree hash and the commit time (Unix timestamp) of the commits from the second repository to a file. We’ll use these to restore the original commit times after the rebase. The tree hash is used to identify each commit. We could also save the commit hashes to the file but they are of no use because they change after the rebase. However, the tree hashes do not change because the rebase does not modify the content of the affected commits, only their parents and commit time.
git log --pretty='%T %ct' ..second-head > /tmp/hashlist
Make first-head
the parent of second-root
Since we are happy with the files from both repositories and just want to paste second-root
on top of
first-head
, any potential conflict must be resolved using the files from the applied commit (second-root
):
git cherry-pick --strategy-option=theirs second-root
This forces git
to apply second-root
on top of first-head
and use the information from second-root
to solve any conflict that appears.
Copy the rest of the commits from the second repository
Try the rebase:
git rebase --preserve-merges --onto first-head --root second-head
It will stop with an error like this:
$ git rebase --preserve-merges --onto first-head --root second-head
The previous cherry-pick is now empty, possibly due to conflict resolution.
If you wish to commit it anyway, use:
git commit --allow-empty
Otherwise, please use 'git reset'
rebase in progress; onto cffbb1c
You are currently rebasing branch 'second-head' on 'cffbb1c'.
nothing to commit, working directory clean
Could not pick 1f7f7036025ac1d48973818b1602fc9aa91731fb
It basically complains that it cannot find any difference between second-root
and first-head
and it is entirely right; using the previous cherry-pick
we just applied the commit second-root
on top of the original first-head
and now first-head
looks identical with second-root
.
Let’s just tell Git to ignore this commit and continue:
git rebase --skip
This would take a while (depending on the size of your second repository) and it should complete successfully. If it fails then you are on your own. But it has no reason to fail.
Fix the committer dates
The rebase
operation keeps most of the meta-data of the commits it changes. It changes the commit hash,
of course, and it also changes the committer date (using the current date). We want to keep the original
committer date (this is the entire point of this article after all).
We can “fix” the original committer dates using a bit of magic:
git filter-branch --env-filter 'export GIT_COMMITTER_DATE=$(fgrep -m 1 $(git log -1 --pretty=%T $GIT_COMMIT) /tmp/hashlist | cut -d" " -f2)' first-master..second-head
In plain English, git filter-branch
lets you rewrite Git revision history by applying custom filter
on each revision. Our custom filter identifies the commit to be changed by its tree hash, finds the
corresponding commit date into the backup file we created earlier and uses the $GIT_COMMITTER_DATE
environment variable to set the desired committer date
to the commit being processed.
If something goes wrong
The previous position of the second-head
branch can be found in the file
.git/refs/original/refs/heads/master
cat .git/refs/original/refs/heads/second-head
To revert the git filter-branch
:
git reset --hard $(cat .git/refs/original/refs/heads/second-head)
Before trying to git filter-branch
again, the backup ref file must be deleted (filter-branch
refuses
to run if it founds it):
rm .git/refs/original/refs/heads/second-head
Cleanup
After the successful linking, the current branch is second-head
and we have some branches pointing
to various commits involved in the process. We can rename second-head
to master
and remove the other branches.
git branch -m second-head master
git branch -D first-head
git branch -D second-root
git branch -D second-master
rm .git/refs/original/refs/heads/second-head
The branch first-master
is still there, pointing to the master
branch of the first repository. You may
probably want to keep it as reference (or, better, create a tag pointing on that commit.)
Remove the hash file:
rm /tmp/hashlist
Remarks
Only the current branch from the new repository will be appended to the old repository; any dangling branch needs to be rebased individually after the process completes; the same technique could work, given the join points are set up correctly.
Extras from
git help filter-branch
:
Note that since this operation is very I/O expensive, it might be a good idea to redirect the temporary directory off-disk with the -d option, e.g. on tmpfs. Reportedly the speedup is very noticeable.
It took a couple of seconds for me, for about 2,500 commits, but it is not relevant because my repository was stored on a SSD.
Because of the rebase, ALL the commits from the newer repository changed their hashes. If the repository is published this will puzzle the other contributors. Before attempting this stunt, make sure that all the important branches are merged, everybody knows what’s going on and how to catch up and continue afterwards without losing their work.
You have been warned!