The guts of git
For a while, the leading contender appeared to be monotone, which supports the distributed development model used with the kernel. There are some issues with monotone, however, with performance being at the top of the list: monotone simply does not scale to a project as large as the kernel. So Linus has, in classic form, gone off to create something of his own. The first version of the tool called "git" was announced on April 7. Since then, the tool has progressed rapidly. It is, however, a little difficult to understand from the documentation which is available at this point. Here's an attempt to clarify things.
Git is not a source code management (SCM) system. It is, instead, a set of low-level utilities (Linus compares it to a special-purpose filesystem) which can be used to construct an SCM system. Much of the higher-level work is yet to be done, so the interface that most developers will work with remains unclear.
At the lower levels, Git implements two data structures: an object database, and a directory cache. The object database can contain three types of objects:
- Blobs are simply chunks of binary data - they are the contents
of files. One blob exists in the object database for every revision
of every file that git knows about. There is no direct connection
between a blob and the name (or location) of the file which contains
that blob. If a file is renamed, its blob in the object database
remains unchanged.
- Trees are a collection of blobs, along with their file names
and permissions. A tree object describes the state of a directory
hierarchy at a particular given time.
- Commits (or "changesets") mark points in the history of a tree; they contain a log message, a tree object, and pointers to one or more "parent" commits (the first commit will have no parent).
The object database relies heavily on SHA hashes to function. When an object is to be added to the database, it is hashed, and the resulting checksum (in its ASCII representation) is used as its name in the database (almost - the first two bytes of the checksum are used to spread the files across a set of directories for efficiency). Some developers have expressed concerns about hash collisions, but that possibility does not seem to worry the majority. The object itself is compressed before being checksummed and stored.
It's worth repeating that git stores every revision of an object separately in the database, addressed by the SHA checksum of its contents. There is no obvious connection between two versions of a file; that connection is made by following the commit objects and looking at what objects were contained in the relevant trees. Git might thus be expected to consume a fair amount of disk space; unlike many source code management systems, it stores whole files, rather than the differences between revisions. It is, however, quite fast, and disk space is considered to be cheap.
The directory cache is a single, binary file containing a tree object; it captures the state of the directory tree at a given time. The state as seen by the cache might not match the actual directory's contents; it could differ as a result of local changes, or of a "pull" of a repository from elsewhere.
If a developer wishes to create a repository from scratch, the first step is to run init-db in the top level of the source tree. People running PostgreSQL want to be sure not to omit the hyphen, or they may not get the results they were hoping for. init-db will create the directory cache file (.dircache/index); it will also, by default, create the object database in .dircache/objects. It is possible for the object database to be elsewhere, however, and possibly shared among users. The object database will initially be empty.
Source files can be added with the update-cache program. update-cache --add will add blobs to the object database for new files and create new blobs (leaving the old ones in place) for any files which have changed. This command will also update the directory cache with entries associating the current files' blobs with their current names, locations, and permissions.
What update-cache will not do is capture the state of the tree in any permanent way. That task is done by write-tree, which will generate a new tree object from the current directory cache and enter that object into the database. write-tree writes the SHA checksum associated with the new tree object to its standard output; the user is well-advised to capture that checksum, or the newly-created tree will be hard to access in the future.
The usual thing to do with a new tree object will be to bind it into a commit object; that is done with the commit-tree command. commit-tree takes a tree ID (the output from write-tree) and a set of parent commits, combines them with the changelog entry, and stores the whole thing as a commit object. That object, in essence, becomes the head of the current version of the source tree. Since each commit points to its parents, the entire commit history of the tree can be traversed by starting at the head. Just don't lose the SHA checksum for the last commit. Since each commit contains a tree object, the state of the source tree at commit time can be reconstructed at any point.
The directory cache can be set to a given version of the tree by using read-tree; this operation reads a tree object from the object database and stores it in the directory cache, but does not actually change any files outside of the cache. From there, checkout-cache can be used make the actual source tree look like the cached tree object. The show-diff tool prints the differences between the directory cache and what's actually in the directory tree currently. There is also a diff-tree tool which can generate the differences between any two trees.
An early example of what can be done with these tools can be had by playing with the git-pasky distribution by Petr Baudis. Petr has layered a set of scripts over the git tools to create something resembling a source management system. The git-pasky distribution itself is available as a network repository; running "git pull" will update to the current version.
A "pull" operation, as implemented in git-pasky, performs these steps:
- The current "head" commit for the local repository
is found; git-pasky keeps the SHA checksum
for the current commit in .dircache/HEAD.
- The current head is obtained from the remote repository (using
rsync) and compared with the local head. If the two are the
same, no changes have been made and the job is done.
- The remote object database is downloaded, again with rsync.
This operation will add any new objects to the database.
- Using diff-tree, a patch from the previous (local) version to the
current (remote) version is generated. That patch is then applied to the
current directory's contents. The patch technique is used to help
preserve, if possible, any local changes to the files.
- A call to read-tree updates the directory cache to match the current revision as obtained from the remote repository.
Petr's version of git adds a number of other features as well. It is a far cry from a full-blown source code management system, since it lacks little details like release tagging, merging, graphical interfaces, etc. A beginning structure is beginning to emerge, however.
When this work was begun, it was seen as a sort of insurance policy to be used until a real source management system could be found. There is a good chance, however, that git will evolve into something with staying power. It provides the needed low-level functionality in a reasonably simple way, and it is blindingly fast. Linus places a premium on speed:
As if on cue, Andrew announced a set of 198 patches to be merged for 2.6.12:
If this test (and the ones that come after) goes well, and the resulting
system evolves to where it meets Linus's needs, he may be unlikely to
switch to yet another system in the future. So git is worth watching; it
could develop into a powerful system in a hurry.
Index entries for this article | |
---|---|
Kernel | Development tools/Git |
Kernel | Git |
(Log in to post comments)
An amazing set of "eyes"
Posted Apr 12, 2005 16:42 UTC (Tue) by b7j0c (subscriber, #27559) [Link]
Apparently all bugs are shallow with enough eyes looking at them - well this tool has an exceptional pool of eyes looking at it - the kernel devs. I would not be surprised if this tool becomes good, quickly.
An amazing set of "eyes"
Posted Apr 12, 2005 17:08 UTC (Tue) by hppnq (guest, #14462) [Link]
The road Linus has more or less laid out seems to be exactly the right thing to do too. I don't think it is a coincidence that Linus kicked off this project himself: it almost ensures that many talented people will want to be involved (by coding or reviewing) -- and that there is a sound base on which those people can build great SCM software.Open Source in action, it is still amazing.
Results of first test with 198 patches.
Posted Apr 12, 2005 17:05 UTC (Tue) by StevenCole (guest, #3068) [Link]
Linus seems happy with this first test. I just hope the disk usage doesn't become an issue down the road.
On Tue, 12 Apr 2005, Andrew Morton wrote: >> >> This is the first live test of Linus's git-importing ability. I'm about >> to disappear for 1.5 weeks - hope we'll still have a kernel left when I >> get back. Yee-haa! 198 patches applied in less than 3 minutes. That's pretty exactly the "one patch per second" I was aiming for (0.8 seconds per patch, so my estimate from a few days ago of 0.75 was pretty much on the money). > du -sh .git 102M .git > time dotest ~/andrews-first-patchbomb .. "Applying" messages scroll past .. real 2m39.840s user 1m40.594s sys 0m58.179s > time show-diff real 0m0.148s user 0m0.080s sys 0m0.068s > du -sh .git 111M .git ie we added 9MB of stuff from a set of emails that totaled a 859kB mbox. So say an expansion of about 10x over the pure emailed patches. Which is not out-of-line with my expectations, but considering that you _could_ have just compressed the patches and thrown the headers away and you'd have gotten a 190kB archive of just pure patches, it's not like this is hugely space-efficient. I don't think I ever claimed it would be ;) Anyway, I'm not going to release this tree, because quite frankly I want to double-check that everything went right, and I want to re-base the archive on some more history than starting _purely_ from scratch in 2.6.12-rc2 (maybe from 2.6.11), but in general it looks good. Now, if I can get the stupid merging going on, it will actually be _useful_ ;) Linus PS. Yes, the tree still builds after this exercise ;)
Results of first test with 198 patches.
Posted Apr 13, 2005 22:37 UTC (Wed) by mattdm (subscriber, #18) [Link]
Linus seems happy with this first test. I just hope the disk usage doesn't become an issue down the road.
I think "down the road" is exactly the time when we won't have to worry so much about disk usage. :)
Results of first test with 198 patches.
Posted Apr 14, 2005 3:40 UTC (Thu) by StevenCole (guest, #3068) [Link]
I think "down the road" is exactly the time when we won't have to worry so much about disk usage. :)With the continued improvement in disk storage capacity, I'm sure you're right. A related concern is network bandwidth. Not everyone has or can have a high-speed link.
Here is Linus and Andrew's take on the subject:
On Tue, 12 Apr 2005, Andrew Morton wrote: > > Linus Torvalds (torvalds at osdl.org) wrote: > > > > ie we added 9MB of stuff from a set of emails that totaled a 859kB mbox. > > The total size of the commits list since Nov 2002 is 500MB, excluding those > "merge" thingies. > > So I assume that the git tree will grow at 2GB/year. Yes, that's within my mental envelope. I was estimating a 3-5GB git archive for the last three years of BK work. The good news is that the way git works, you really can put the old history in "storage" - throw it away (and just rely on the distribution meaning that it's _somewhere_ out there on the net) or write it on a DVD and forgetting about it. Most people really only care about the last few months. Is 2GB a year a lot? I think it's peanuts, but hey, I can fill up my whole disk with kernel trees, and I wouldn't feel it's wasted space. Others may have slightly different priorities ("hey, I could fit 5000 songs in there!") Linus
Results of first test with 198 patches.
Posted Apr 14, 2005 19:07 UTC (Thu) by iabervon (subscriber, #722) [Link]
Hopefully, the system won't touch more history than is absolutely necessary, and it can be taught to only fetch from the home of the big disk the history it needs and doesn't have. That way, people will only download things that they actually need. Even if someone ends up with the complete history, 2GB/year is only 544 bps; so long as nobody has to wait for 5 GB to download at once, it's fine.
Results of first test with 198 patches.
Posted Apr 21, 2005 15:42 UTC (Thu) by huaz (guest, #10168) [Link]
I fully disagree.
It might be the right thing to do now because it's simpler to do, not because disk space is cheap. That is ALWAYS an excuse.
It's OK if git is just a linux kernel specific tool. If someone wants to make a general SCM on it, I wouldn't even want to try if I know it doesn't even support "diff".
Previous git
Posted Apr 12, 2005 17:48 UTC (Tue) by bkw1a (subscriber, #4101) [Link]
But what about this:
$ man git
...
NAME
git - GNU Interactive Tools
SYNTAX
git [options] [path1] [path2]
gitps [options]
gitview [options] filename
...
DESCRIPTION
git is a file system browser with some shell facilities
which was designed to make your work much easier and much
efficient.
Previous git
Posted Apr 12, 2005 19:47 UTC (Tue) by khim (subscriber, #9252) [Link]
What, indeed ?
-rw-r--r-- 1 0 0 8751 Jun 06 1996 git-4.3.10-4.3.11.diff.gz -rw-r--r-- 1 0 0 55745 Aug 30 1996 git-4.3.11-4.3.12.diff.gz -rw-r--r-- 1 0 0 76156 Nov 12 1996 git-4.3.12-4.3.13.diff.gz -rw-r--r-- 1 0 0 39461 Dec 05 1996 git-4.3.13-4.3.14.diff.gz -rw-r--r-- 1 0 0 35846 Dec 24 1996 git-4.3.14-4.3.15.diff.gz -rw-r--r-- 1 0 0 12872 Jan 28 1997 git-4.3.15-4.3.16.diff.gz -rw-r--r-- 1 0 0 336211 Jan 28 1997 git-4.3.16.tar.gz -rw-r--r-- 1 0 0 402888 Mar 14 1998 git-4.3.17.tar.gz -rw-r--r-- 1 0 0 5629 Jun 29 1999 git-4.3.18-4.3.19.diff.gz -rw-r--r-- 1 0 0 406138 Jun 01 1999 git-4.3.18.tar.gz -rw-r--r-- 1 0 0 21475 Mar 13 2000 git-4.3.19-4.3.20.diff.gz -rw-r--r-- 1 0 0 406914 Jun 29 1999 git-4.3.19.tar.gz -rw-r--r-- 1 0 0 426648 Mar 13 2000 git-4.3.20.tar.gz
I think we can safely presume it's abandoned. And we already have ACE and ACE, BALSA and BALSA, FUSE and FUSE, etc... So... What is the problem ?
Previous git
Posted Apr 13, 2005 14:50 UTC (Wed) by edomaur (subscriber, #14520) [Link]
and even Gentoo and Gentoo :)
monotone 0.18 with some speed improvements came on the 11th
Posted Apr 12, 2005 19:26 UTC (Tue) by ber (subscriber, #2142) [Link]
On the April 11, monotone 0.18 was released withsome speed improvements. Linus is in the credits for the speed improvements, too.
Great name, but ...
Posted Apr 12, 2005 21:19 UTC (Tue) by wjhenney (guest, #11768) [Link]
... so what happens when Tridge pulls the free version of rsync? ;)
- The current head is obtained from the remote repository (using rsync) and compared with the local head. If the two are the same, no changes have been made and the job is done.
- The remote object database is downloaded, again with rsync. This operation will add any new objects to the database.
Great name, but ...
Posted Apr 12, 2005 22:51 UTC (Tue) by iabervon (subscriber, #722) [Link]
Linus writes a replacement, obviously. :)
Great name, but ...
Posted Apr 13, 2005 14:44 UTC (Wed) by dpash (guest, #1408) [Link]
rsync is GPL. If Tridge stops distributing it, you can still use older versions.
Great name, but ...
Posted Apr 13, 2005 17:13 UTC (Wed) by wjhenney (guest, #11768) [Link]
Hmm, I thought the winking emoticon was sufficient. Next time I'll try tobe more explicit. Unfortunately, <joke> ... </joke> is rejected by the
comment editor if you post as HTML.
Great name, but ...
Posted Apr 20, 2005 18:26 UTC (Wed) by vonbrand (guest, #4458) [Link]
Hummm... if <joke> ... </joke> doesn't work, maybe <joke> ... </joke> does ;-)
Great name, but ...
Posted Apr 15, 2005 14:20 UTC (Fri) by tzafrir (subscriber, #11501) [Link]
s/r/z/ and you get the answer:
The original idea and initial implementation was by the same Tridge but he had to stop develop it.
The guts of git
Posted Apr 13, 2005 4:36 UTC (Wed) by bronson (subscriber, #4806) [Link]
I love Linus's attitude. "Disk space? Heck, it's not a problem now. If it becomes a problem, I suppose we'll fix it then." (my paraphrase) I've worked on a number of projects, failed of course, that needed this sort of pragmatism.
The guts of git
Posted Apr 13, 2005 9:27 UTC (Wed) by ekj (guest, #1524) [Link]
But in the context of source-code it's obviosuy true. Not nessecarily so for all other contexts.Source-code is *very* small and *very* compressible in relation to how much work it takes to produce it. If you invest a million dollars in developing some software over a year. *ALL* revisions of *ALL* files can still be stored, completely uncompressed for a storage-cost in the pennies-range.
There ain't anyone seriously working on Kernel-development that has a problem storing 10 or 100GB in order to do so efficiently. And hardisks are less than a dollar/GB.
The guts of git
Posted Apr 14, 2005 21:02 UTC (Thu) by joey (guest, #328) [Link]
hmm, I've done some calculations before on checking out all revisions of all data I keep in my subversion repositories. IIRC, checking out all versions of all files in my ~3 gb of repositories would need closer to 1 terabyte of data than 100 gb. Not very practical for laptop use. :-)
The guts of git
Posted Apr 15, 2005 7:49 UTC (Fri) by njhurst (guest, #6022) [Link]
Have you consideredfor
loops?
The guts of git
Posted Apr 15, 2005 21:45 UTC (Fri) by proski (subscriber, #104) [Link]
You cannot just multiply the number of revisions by the number of files unless you change all files in every revision. The files that don't change between revisions are not stored as separate copies (because their SHA1 checksum is the same). In fact, if you revert to original file contents, the repository would be reusing the old files.
git Mailing List
Posted Apr 13, 2005 7:21 UTC (Wed) by PaulDickson (guest, #478) [Link]
I totally missed the announcement about the git mailing list. I only sawa single refersnce to it. Here's a link:
http://vger.kernel.org/vger-lists.html#git
Howto checkin git hook
Posted Aug 17, 2009 17:13 UTC (Mon) by dajoe13 (guest, #60297) [Link]
I am trying to add a post-checkout hook. I also noticed that the post-checkout sample does not exist when I init a new archive. I am running git version 1.6.0.2.
Regards,
Joe
Howto checkin git hook
Posted Aug 17, 2009 20:22 UTC (Mon) by nix (subscriber, #2304) [Link]
dangerous, if you think about it, because hooks are executable code!
You have to use some other mechanism (ssh, whatever) to install it at the
server end.
Linus has guts
Posted Apr 14, 2005 9:58 UTC (Thu) by Tobu (subscriber, #24111) [Link]
What a sucker, this Linus guy.
I mean: for years, he's been ignoring the free SCM, on the grounds that they "didn't scale" for such a large project as the kernel. When he finally is bitten, he still scorns the alternatives, and comes up with a downgraded monotone sketch.
Being something of a rockstar, he'll get away with it, and will get enough developper support because people must still be able to talk to his tree; but in terms of development time, he's just throwing others' work out of the window.
Linus has guts
Posted Apr 14, 2005 11:34 UTC (Thu) by filipjoelsson (guest, #2622) [Link]
Hmm, if it takes Linus two hours to merge a patch-bomb, I think he has put less time into developing this new foundation for an SCM - than he would have been spending waiting for one of the existing SCMs merging his patches over the next month or two.
Others are spending time inventing fancy algos, theories, and such, but fail on the real-world test. Why should Linus not scratch this desperate itch then? Why should he come to an existing project, and overturn it with his rather extreme demands? Does he have the time to wait for one of them to "grow up"? I think not. Remember that he is working full time, whereas many contributors to smaller projects are programming in their spare time. Thus, waiting on the spare-time programmers would be rather frustrating.
So, it's not a question about Linus being a rock star. It's a question of efficiency.
Linus has guts
Posted Apr 14, 2005 18:44 UTC (Thu) by dlang (guest, #313) [Link]
the other projects have had three years to develop a better SCM solution.
as Linux pointed out in the announcement that he was going to look at other options for SCM they aren't useable yet for something the size of the kernel yet.
we'll see if git ends up beng a dead-end system or the base for a better system in the future, but it has a very different focus then the other SCM options.
every one of the other options is starting with what the UI is, what commands the scm will implement. git is starting from the other direction and building the low-level pices (and building them to be fast) and then layering the UI on top of that.
this seems like a much better approach to take, and the other SCM systems can take advantage of these primitives as well (assuming they are willing to live with a linux-only limitation or the slowdown from other systems that don't have some of the underlying infrastructure that git is counting on in the OS for it's speed)
Linus has guts
Posted Apr 14, 2005 14:38 UTC (Thu) by kevinbsmith (guest, #4778) [Link]
At first, I thought Linus was wasting his time (and that of others) on git. It is still possible that he is, but if he really has found a super-simple alternative to the complexity of other distributed SCM systems, then he has really achieved something remarkable. Sometimes it takes an outsider to see what the experts overlook.
If (almost) anyone else had pitched this idea to him, I doubt Linus would have paid any attention. If (almost) anyone else had come up with git, they would not have immediately attracted a community of developers to help build it. Linus also has the advantage of targeting a single workflow process within a single project. That's far easier than creating a generic tool.
If someone can wrap a true, minimal but functional generic SCM around this thing, I will try it. I like minimalist simplicity. We should know within a few weeks whether the fundamental ideas are a good foundation, or have dead-end limitations. At the moment, I'm betting it will work.
One other thing: I wish the authors of the various free distributed SCM's could gather to critique git's future possibilities. Assuming they could set aside their own biases, they are in the best position to point out limitations and guide the design.
Linus has guts
Posted Mar 17, 2011 14:58 UTC (Thu) by nix (subscriber, #2304) [Link]
I think we can say that git is by now more than a 'downgraded monotone sketch' :)
The guts of git
Posted Apr 14, 2005 12:19 UTC (Thu) by lyda (guest, #7429) [Link]
i can't get git 0.04 to compile on a redhat 7.3 box.
update-cache.c: In function `index_fd':
update-cache.c:26: warning: implicit declaration of function `close'
update-cache.c: In function `fill_stat_cache_info':
update-cache.c:68: structure has no member named `st_ctim'
update-cache.c:70: structure has no member named `st_mtim'
update-cache.c: In function `match_data':
update-cache.c:116: warning: implicit declaration of function `read'
update-cache.c: In function `main':
update-cache.c:282: warning: implicit declaration of function `unlink'
make: *** [update-cache.o] Error 1
The guts of git
Posted Apr 14, 2005 19:49 UTC (Thu) by proski (subscriber, #104) [Link]
I guess you need to define _GNU_SOURCE for the st_* fields and include unistd.h for missing declarations. Userspace programming with all those compatibility issues is so much harder than the kernel, it's it? i:-)