The guts of git

[Posted April 12, 2005 by corbet]

Now that BitKeeper is no more, how will the kernel development process function? In the short term, the answer is "painfully." The rest of the 2.6.12 process looks like the good old days: patches emailed to Linus, who will apply them (hopefully) and occasionally release a snapshot tree. That mode might work for the short term, since only bug fixes should be merged before 2.6.12 comes out, but nobody wants to try to run the process that way for any period of time. The kernel team needs much better patch and workflow support if it is going to sustain a reasonable development pace. So a replacement for BitKeeper will have to come from somewhere.

For a while, the leading contender appeared to be monotone, which supports the distributed development model used with the kernel. There are some issues with monotone, however, with performance being at the top of the list: monotone simply does not scale to a project as large as the kernel. So Linus has, in classic form, gone off to create something of his own. The first version of the tool called "git" was announced on April 7. Since then, the tool has progressed rapidly. It is, however, a little difficult to understand from the documentation which is available at this point. Here's an attempt to clarify things.

Git is not a source code management (SCM) system. It is, instead, a set of low-level utilities (Linus compares it to a special-purpose filesystem) which can be used to construct an SCM system. Much of the higher-level work is yet to be done, so the interface that most developers will work with remains unclear.

At the lower levels, Git implements two data structures: an object database, and a directory cache. The object database can contain three types of objects:

Blobs are simply chunks of binary data - they are the contents of files. One blob exists in the object database for every revision of every file that git knows about. There is no direct connection between a blob and the name (or location) of the file which contains that blob. If a file is renamed, its blob in the object database remains unchanged.
Trees are a collection of blobs, along with their file names and permissions. A tree object describes the state of a directory hierarchy at a particular given time.
Commits (or "changesets") mark points in the history of a tree; they contain a log message, a tree object, and pointers to one or more "parent" commits (the first commit will have no parent).

The object database relies heavily on SHA hashes to function. When an object is to be added to the database, it is hashed, and the resulting checksum (in its ASCII representation) is used as its name in the database (almost - the first two bytes of the checksum are used to spread the files across a set of directories for efficiency). Some developers have expressed concerns about hash collisions, but that possibility does not seem to worry the majority. The object itself is compressed before being checksummed and stored.

It's worth repeating that git stores every revision of an object separately in the database, addressed by the SHA checksum of its contents. There is no obvious connection between two versions of a file; that connection is made by following the commit objects and looking at what objects were contained in the relevant trees. Git might thus be expected to consume a fair amount of disk space; unlike many source code management systems, it stores whole files, rather than the differences between revisions. It is, however, quite fast, and disk space is considered to be cheap.

The directory cache is a single, binary file containing a tree object; it captures the state of the directory tree at a given time. The state as seen by the cache might not match the actual directory's contents; it could differ as a result of local changes, or of a "pull" of a repository from elsewhere.

If a developer wishes to create a repository from scratch, the first step is to run init-db in the top level of the source tree. People running PostgreSQL want to be sure not to omit the hyphen, or they may not get the results they were hoping for. init-db will create the directory cache file (.dircache/index); it will also, by default, create the object database in .dircache/objects. It is possible for the object database to be elsewhere, however, and possibly shared among users. The object database will initially be empty.

Source files can be added with the update-cache program. update-cache --add will add blobs to the object database for new files and create new blobs (leaving the old ones in place) for any files which have changed. This command will also update the directory cache with entries associating the current files' blobs with their current names, locations, and permissions.

What update-cache will not do is capture the state of the tree in any permanent way. That task is done by write-tree, which will generate a new tree object from the current directory cache and enter that object into the database. write-tree writes the SHA checksum associated with the new tree object to its standard output; the user is well-advised to capture that checksum, or the newly-created tree will be hard to access in the future.

The usual thing to do with a new tree object will be to bind it into a commit object; that is done with the commit-tree command. commit-tree takes a tree ID (the output from write-tree) and a set of parent commits, combines them with the changelog entry, and stores the whole thing as a commit object. That object, in essence, becomes the head of the current version of the source tree. Since each commit points to its parents, the entire commit history of the tree can be traversed by starting at the head. Just don't lose the SHA checksum for the last commit. Since each commit contains a tree object, the state of the source tree at commit time can be reconstructed at any point.

The directory cache can be set to a given version of the tree by using read-tree; this operation reads a tree object from the object database and stores it in the directory cache, but does not actually change any files outside of the cache. From there, checkout-cache can be used make the actual source tree look like the cached tree object. The show-diff tool prints the differences between the directory cache and what's actually in the directory tree currently. There is also a diff-tree tool which can generate the differences between any two trees.

An early example of what can be done with these tools can be had by playing with the git-pasky distribution by Petr Baudis. Petr has layered a set of scripts over the git tools to create something resembling a source management system. The git-pasky distribution itself is available as a network repository; running "git pull" will update to the current version.

A "pull" operation, as implemented in git-pasky, performs these steps:

The current "head" commit for the local repository is found; git-pasky keeps the SHA checksum for the current commit in .dircache/HEAD.
The current head is obtained from the remote repository (using rsync) and compared with the local head. If the two are the same, no changes have been made and the job is done.
The remote object database is downloaded, again with rsync. This operation will add any new objects to the database.
Using diff-tree, a patch from the previous (local) version to the current (remote) version is generated. That patch is then applied to the current directory's contents. The patch technique is used to help preserve, if possible, any local changes to the files.
A call to read-tree updates the directory cache to match the current revision as obtained from the remote repository.

Petr's version of git adds a number of other features as well. It is a far cry from a full-blown source code management system, since it lacks little details like release tagging, merging, graphical interfaces, etc. A beginning structure is beginning to emerge, however.

When this work was begun, it was seen as a sort of insurance policy to be used until a real source management system could be found. There is a good chance, however, that git will evolve into something with staying power. It provides the needed low-level functionality in a reasonably simple way, and it is blindingly fast. Linus places a premium on speed:

If it takes half a minute to apply a patch and remember the changeset boundary etc (and quite frankly, that's _fast_ for most SCM's around for a project the size of Linux), then a series of 250 emails (which is not unheard of at all when I sync with Andrew, for example) takes two hours.

As if on cue, Andrew announced a set of 198 patches to be merged for 2.6.12:

This is the first live test of Linus's git-importing ability. I'm about to disappear for 1.5 weeks - hope we'll still have a kernel left when I get back.

If this test (and the ones that come after) goes well, and the resulting system evolves to where it meets Linus's needs, he may be unlikely to switch to yet another system in the future. So git is worth watching; it could develop into a powerful system in a hurry.

Index entries for this article
Kernel	Development tools/Git
Kernel	Git

(Log in to post comments)

An amazing set of "eyes"

Posted Apr 12, 2005 16:42 UTC (Tue) by b7j0c (subscriber, #27559) [Link]

Apparently all bugs are shallow with enough eyes looking at them - well this tool has an exceptional pool of eyes looking at it - the kernel devs. I would not be surprised if this tool becomes good, quickly.

An amazing set of "eyes"

Posted Apr 12, 2005 17:08 UTC (Tue) by hppnq (guest, #14462) [Link]

The road Linus has more or less laid out seems to be exactly the right thing to do too. I don't think it is a coincidence that Linus kicked off this project himself: it almost ensures that many talented people will want to be involved (by coding or reviewing) -- and that there is a sound base on which those people can build great SCM software.

Open Source in action, it is still amazing.

Results of first test with 198 patches.

Posted Apr 12, 2005 17:05 UTC (Tue) by StevenCole (guest, #3068) [Link]

Linus seems happy with this first test. I just hope the disk usage doesn't become an issue down the road.

On Tue, 12 Apr 2005, Andrew Morton wrote:

>> 
>> This is the first live test of Linus's git-importing ability.  I'm about
>> to disappear for 1.5 weeks - hope we'll still have a kernel left when I
>> get back.

Yee-haa! 198 patches applied in less than 3 minutes. That's pretty exactly
the "one patch per second" I was aiming for (0.8 seconds per patch, so my
estimate from a few days ago of 0.75 was pretty much on the money).

	> du -sh .git 

	102M    .git

	> time dotest ~/andrews-first-patchbomb
	.. "Applying" messages scroll past ..

	real    2m39.840s
	user    1m40.594s
	sys     0m58.179s

	> time show-diff

	real    0m0.148s
	user    0m0.080s
	sys     0m0.068s

	> du -sh .git

	111M    .git

ie we added 9MB of stuff from a set of emails that totaled a 859kB mbox.

So say an expansion of about 10x over the pure emailed patches. Which is 
not out-of-line with my expectations, but considering that you _could_ 
have just compressed the patches and thrown the headers away and you'd 
have gotten a 190kB archive of just pure patches, it's not like this is 
hugely space-efficient.

I don't think I ever claimed it would be ;)

Anyway, I'm not going to release this tree, because quite frankly I want 
to double-check that everything went right, and I want to re-base the 
archive on some more history than starting _purely_ from scratch in 
2.6.12-rc2 (maybe from 2.6.11), but in general it looks good. Now, if I 
can get the stupid merging going on, it will actually be _useful_ ;)

			Linus

PS. Yes, the tree still builds after this exercise ;)

Results of first test with 198 patches.

Posted Apr 13, 2005 22:37 UTC (Wed) by mattdm (subscriber, #18) [Link]

Linus seems happy with this first test. I just hope the disk usage doesn't become an issue down the road.

I think "down the road" is exactly the time when we won't have to worry so much about disk usage. :)

Results of first test with 198 patches.

Posted Apr 14, 2005 3:40 UTC (Thu) by StevenCole (guest, #3068) [Link]

I think "down the road" is exactly the time when we won't have to worry so much about disk usage. :)

With the continued improvement in disk storage capacity, I'm sure you're right. A related concern is network bandwidth. Not everyone has or can have a high-speed link.

Here is Linus and Andrew's take on the subject:

On Tue, 12 Apr 2005, Andrew Morton wrote:
>
> Linus Torvalds (torvalds at osdl.org) wrote:
> >
> > ie we added 9MB of stuff from a set of emails that totaled a 859kB mbox.
> 
> The total size of the commits list since Nov 2002 is 500MB, excluding those
> "merge" thingies.
> 
> So I assume that the git tree will grow at 2GB/year.

Yes, that's within my mental envelope. I was estimating a 3-5GB git
archive for the last three years of BK work.

The good news is that the way git works, you really can put the old
history in "storage" - throw it away (and just rely on the distribution
meaning that it's _somewhere_ out there on the net) or write it on a DVD
and forgetting about it. Most people really only care about the last few 
months.

Is 2GB a year a lot? I think it's peanuts, but hey, I can fill up my whole 
disk with kernel trees, and I wouldn't feel it's wasted space. Others may 
have slightly different priorities ("hey, I could fit 5000 songs in 
there!")

                Linus

Results of first test with 198 patches.

Posted Apr 14, 2005 19:07 UTC (Thu) by iabervon (subscriber, #722) [Link]

Hopefully, the system won't touch more history than is absolutely necessary, and it can be taught to only fetch from the home of the big disk the history it needs and doesn't have. That way, people will only download things that they actually need. Even if someone ends up with the complete history, 2GB/year is only 544 bps; so long as nobody has to wait for 5 GB to download at once, it's fine.

Results of first test with 198 patches.

Posted Apr 21, 2005 15:42 UTC (Thu) by huaz (guest, #10168) [Link]

I fully disagree.

It might be the right thing to do now because it's simpler to do, not because disk space is cheap. That is ALWAYS an excuse.

It's OK if git is just a linux kernel specific tool. If someone wants to make a general SCM on it, I wouldn't even want to try if I know it doesn't even support "diff".

Previous git

Posted Apr 12, 2005 17:48 UTC (Tue) by bkw1a (subscriber, #4101) [Link]

But what about this:

$ man git
...
NAME
git - GNU Interactive Tools

SYNTAX
git [options] [path1] [path2]
gitps [options]
gitview [options] filename
...
DESCRIPTION
git is a file system browser with some shell facilities
which was designed to make your work much easier and much
efficient.

Previous git

Posted Apr 12, 2005 19:47 UTC (Tue) by khim (subscriber, #9252) [Link]

What, indeed ?

-rw-r--r--    1 0        0            8751 Jun 06  1996 git-4.3.10-4.3.11.diff.gz
-rw-r--r--    1 0        0           55745 Aug 30  1996 git-4.3.11-4.3.12.diff.gz
-rw-r--r--    1 0        0           76156 Nov 12  1996 git-4.3.12-4.3.13.diff.gz
-rw-r--r--    1 0        0           39461 Dec 05  1996 git-4.3.13-4.3.14.diff.gz
-rw-r--r--    1 0        0           35846 Dec 24  1996 git-4.3.14-4.3.15.diff.gz
-rw-r--r--    1 0        0           12872 Jan 28  1997 git-4.3.15-4.3.16.diff.gz
-rw-r--r--    1 0        0          336211 Jan 28  1997 git-4.3.16.tar.gz
-rw-r--r--    1 0        0          402888 Mar 14  1998 git-4.3.17.tar.gz
-rw-r--r--    1 0        0            5629 Jun 29  1999 git-4.3.18-4.3.19.diff.gz
-rw-r--r--    1 0        0          406138 Jun 01  1999 git-4.3.18.tar.gz
-rw-r--r--    1 0        0           21475 Mar 13  2000 git-4.3.19-4.3.20.diff.gz
-rw-r--r--    1 0        0          406914 Jun 29  1999 git-4.3.19.tar.gz
-rw-r--r--    1 0        0          426648 Mar 13  2000 git-4.3.20.tar.gz

I think we can safely presume it's abandoned. And we already have ACE and ACE, BALSA and BALSA, FUSE and FUSE, etc... So... What is the problem ?

Previous git

Posted Apr 13, 2005 14:50 UTC (Wed) by edomaur (subscriber, #14520) [Link]

and even Gentoo and Gentoo :)

monotone 0.18 with some speed improvements came on the 11th

Posted Apr 12, 2005 19:26 UTC (Tue) by ber (subscriber, #2142) [Link]

On the April 11, monotone 0.18 was released with
some speed improvements. Linus is in the credits for the speed improvements, too.

http://www.venge.net/monotone/

Great name, but ...

Posted Apr 12, 2005 21:19 UTC (Tue) by wjhenney (guest, #11768) [Link]

The current head is obtained from the remote repository (using rsync) and compared with the local head. If the two are the same, no changes have been made and the job is done.

The remote object database is downloaded, again with rsync. This operation will add any new objects to the database.

... so what happens when Tridge pulls the free version of rsync? ;)

Great name, but ...

Posted Apr 12, 2005 22:51 UTC (Tue) by iabervon (subscriber, #722) [Link]

Linus writes a replacement, obviously. :)

Great name, but ...

Posted Apr 13, 2005 14:44 UTC (Wed) by dpash (guest, #1408) [Link]

rsync is GPL. If Tridge stops distributing it, you can still use older versions.

Great name, but ...

Posted Apr 13, 2005 17:13 UTC (Wed) by wjhenney (guest, #11768) [Link]

Hmm, I thought the winking emoticon was sufficient. Next time I'll try to
be more explicit. Unfortunately, <joke> ... </joke> is rejected by the
comment editor if you post as HTML.

Great name, but ...

Posted Apr 20, 2005 18:26 UTC (Wed) by vonbrand (guest, #4458) [Link]

Hummm... if <joke> ... </joke> doesn't work, maybe <joke> ... </joke> does ;-)

Great name, but ...

Posted Apr 15, 2005 14:20 UTC (Fri) by tzafrir (subscriber, #11501) [Link]

s/r/z/ and you get the answer:

http://zsync.moria.org.uk/

The original idea and initial implementation was by the same Tridge but he had to stop develop it.

The guts of git

Posted Apr 13, 2005 4:36 UTC (Wed) by bronson (subscriber, #4806) [Link]

I love Linus's attitude. "Disk space? Heck, it's not a problem now. If it becomes a problem, I suppose we'll fix it then." (my paraphrase) I've worked on a number of projects, failed of course, that needed this sort of pragmatism.

The guts of git

Posted Apr 13, 2005 9:27 UTC (Wed) by ekj (guest, #1524) [Link]

But in the context of source-code it's obviosuy true. Not nessecarily so for all other contexts.

Source-code is *very* small and *very* compressible in relation to how much work it takes to produce it. If you invest a million dollars in developing some software over a year. *ALL* revisions of *ALL* files can still be stored, completely uncompressed for a storage-cost in the pennies-range.

There ain't anyone seriously working on Kernel-development that has a problem storing 10 or 100GB in order to do so efficiently. And hardisks are less than a dollar/GB.

The guts of git

Posted Apr 14, 2005 21:02 UTC (Thu) by joey (guest, #328) [Link]

hmm, I've done some calculations before on checking out all revisions of all data I keep in my subversion repositories. IIRC, checking out all versions of all files in my ~3 gb of repositories would need closer to 1 terabyte of data than 100 gb. Not very practical for laptop use. :-)

The guts of git

Posted Apr 15, 2005 7:49 UTC (Fri) by njhurst (guest, #6022) [Link]

Have you considered for loops?

The guts of git

Posted Apr 15, 2005 21:45 UTC (Fri) by proski (subscriber, #104) [Link]

You cannot just multiply the number of revisions by the number of files unless you change all files in every revision. The files that don't change between revisions are not stored as separate copies (because their SHA1 checksum is the same). In fact, if you revert to original file contents, the repository would be reusing the old files.

git Mailing List

Posted Apr 13, 2005 7:21 UTC (Wed) by PaulDickson (guest, #478) [Link]

I totally missed the announcement about the git mailing list. I only saw
a single refersnce to it. Here's a link:

http://vger.kernel.org/vger-lists.html#git

Howto checkin git hook

Posted Aug 17, 2009 17:13 UTC (Mon) by dajoe13 (guest, #60297) [Link]

I have not been able to find out how to commit and push a hook to my git server archive for everyone's benefit. The githooks man page does not descrive this and I have not turned up any fruitful google searches on the topic.

I am trying to add a post-checkout hook. I also noticed that the post-checkout sample does not exist when I init a new archive. I am running git version 1.6.0.2.

Regards,
Joe

Howto checkin git hook

Posted Aug 17, 2009 20:22 UTC (Mon) by nix (subscriber, #2304) [Link]

You can't push hooks, just as with everyting else under .git/: it would be
dangerous, if you think about it, because hooks are executable code!

You have to use some other mechanism (ssh, whatever) to install it at the
server end.

Linus has guts

Posted Apr 14, 2005 9:58 UTC (Thu) by Tobu (subscriber, #24111) [Link]

What a sucker, this Linus guy.

I mean: for years, he's been ignoring the free SCM, on the grounds that they "didn't scale" for such a large project as the kernel. When he finally is bitten, he still scorns the alternatives, and comes up with a downgraded monotone sketch.
Being something of a rockstar, he'll get away with it, and will get enough developper support because people must still be able to talk to his tree; but in terms of development time, he's just throwing others' work out of the window.

Linus has guts

Posted Apr 14, 2005 11:34 UTC (Thu) by filipjoelsson (guest, #2622) [Link]

Hmm, if it takes Linus two hours to merge a patch-bomb, I think he has put less time into developing this new foundation for an SCM - than he would have been spending waiting for one of the existing SCMs merging his patches over the next month or two.

Others are spending time inventing fancy algos, theories, and such, but fail on the real-world test. Why should Linus not scratch this desperate itch then? Why should he come to an existing project, and overturn it with his rather extreme demands? Does he have the time to wait for one of them to "grow up"? I think not. Remember that he is working full time, whereas many contributors to smaller projects are programming in their spare time. Thus, waiting on the spare-time programmers would be rather frustrating.

So, it's not a question about Linus being a rock star. It's a question of efficiency.

Linus has guts

Posted Apr 14, 2005 18:44 UTC (Thu) by dlang (guest, #313) [Link]

the other projects have had three years to develop a better SCM solution.

as Linux pointed out in the announcement that he was going to look at other options for SCM they aren't useable yet for something the size of the kernel yet.

we'll see if git ends up beng a dead-end system or the base for a better system in the future, but it has a very different focus then the other SCM options.

every one of the other options is starting with what the UI is, what commands the scm will implement. git is starting from the other direction and building the low-level pices (and building them to be fast) and then layering the UI on top of that.

this seems like a much better approach to take, and the other SCM systems can take advantage of these primitives as well (assuming they are willing to live with a linux-only limitation or the slowdown from other systems that don't have some of the underlying infrastructure that git is counting on in the OS for it's speed)

Linus has guts

Posted Apr 14, 2005 14:38 UTC (Thu) by kevinbsmith (guest, #4778) [Link]

At first, I thought Linus was wasting his time (and that of others) on git. It is still possible that he is, but if he really has found a super-simple alternative to the complexity of other distributed SCM systems, then he has really achieved something remarkable. Sometimes it takes an outsider to see what the experts overlook.

If (almost) anyone else had pitched this idea to him, I doubt Linus would have paid any attention. If (almost) anyone else had come up with git, they would not have immediately attracted a community of developers to help build it. Linus also has the advantage of targeting a single workflow process within a single project. That's far easier than creating a generic tool.

If someone can wrap a true, minimal but functional generic SCM around this thing, I will try it. I like minimalist simplicity. We should know within a few weeks whether the fundamental ideas are a good foundation, or have dead-end limitations. At the moment, I'm betting it will work.

One other thing: I wish the authors of the various free distributed SCM's could gather to critique git's future possibilities. Assuming they could set aside their own biases, they are in the best position to point out limitations and guide the design.

Linus has guts

Posted Mar 17, 2011 14:58 UTC (Thu) by nix (subscriber, #2304) [Link]

With six years' perspective, this comment is quite amusing. Most of the core of git is unchanged from this earliest sketch: one thing has been renamed (cache -> index), but the only major new concept is packfiles, which have zero impact on anything but disk space.

I think we can say that git is by now more than a 'downgraded monotone sketch' :)

The guts of git

Posted Apr 14, 2005 12:19 UTC (Thu) by lyda (guest, #7429) [Link]

i can't get git 0.04 to compile on a redhat 7.3 box.

update-cache.c: In function `index_fd':
update-cache.c:26: warning: implicit declaration of function `close'
update-cache.c: In function `fill_stat_cache_info':
update-cache.c:68: structure has no member named `st_ctim'
update-cache.c:70: structure has no member named `st_mtim'
update-cache.c: In function `match_data':
update-cache.c:116: warning: implicit declaration of function `read'
update-cache.c: In function `main':
update-cache.c:282: warning: implicit declaration of function `unlink'
make: *** [update-cache.o] Error 1

The guts of git

Posted Apr 14, 2005 19:49 UTC (Thu) by proski (subscriber, #104) [Link]

I guess you need to define _GNU_SOURCE for the st_* fields and include unistd.h for missing declarations. Userspace programming with all those compatibility issues is so much harder than the kernel, it's it? i:-)