Gits Guts: Part II

In the first part of this article series we looked into the objects that make up Git’s datastore (blobs, trees and commits). We also saw how commits participate in a directed acyclic graph (or DAG).

In this second, and final part of this series we examine a few Git commands and see how they work with and manipulate the DAG.

This is an article I wrote for NFJS, The Magazine's October, 2014 issue. This is a 2-part series, this being the second one. You can find the first part here

Initial set up

If you have been playing along since we last met then you can continue to use the gitsGuts repository we set up. Otherwise let us quickly initialize a new repository with a few commits so we have a basis to work with.

Setting up a play repository

 $ git init gitsGuts (1)
 # Initialized empty Git repository in /Users/looselytyped/Documents/articles/gitsGuts/.git/
 $ cd gitsGuts (2)
 $ (master) echo 'Hello Git!' > README.md (3)
 $ (master) mkdir src
 $ (master) echo '// This is my source code' > src/Main.java (4)
 $ (master) cd ..
 $ (master) git add .
 $ (master) git commit -m "Initial commit" (5)
 # [master (root-commit) 3cf00f8] Initial commit
 # 2 files changed, 2 insertions(+)
 # create mode 100644 README.md
 # create mode 100644 src/Main.java
 $ (master) echo 'Making another commit' >> README.md (6)
 $ (master) git add README.md
 $ (master) git ci -m "Second commit" (7)
 # [master aed7e05] Second commit
 # 1 file changed, 1 insertion(+)
....
<1> Initialize a new respository
<2> Be sure to cd into it!
<3> Initialize a README me file with some text
<4> Initialize another plain text file inside the src sub-directory
<5> git-add and git-commit both the files
<6> Edit the README file by appending some text
<7> Make a second commit

We now have a Git repository with 2 files and 2 commits. Just to be sure we are on the same page let us inspect directory structure using the tree command. I also display an abbreviated version of the Git log. ^[1]

 $ (master) tree (1)
 # .
 # ├── README.md
 # └── src
 #    └── Main.java

 # 1 directory, 2 files
 $ (master) git lg (2)
 # * aed7e05        -  (HEAD, master) Second commit <Raju Gandhi>
 # * 3cf00f8        -  Initial commit <Raju Gandhi>
....
<1> Display the structure of the repository
<2> An abbreviated Git log

Bear in mind that the hashes of your commits will be different than those you see in my log. The latest commit in my repository happens to be aed7e05 — be sure to remember yours.

Looking good? Then let us talk about branching.

`git-branch`

If you have used Git for any amount of time then you are most certainly used to and are most likely an ardent proponent of branching. You have probably been even told or heard that branching in Git is really cheap. So how does branching in Git really work?

One way to think about branches in Git is to think of them as sticky notes. You can visualize these sticky notes to have two lines of text in them — the first line contains the name of the branch, written by a thick permanent marker. The second line on the sticky note happens to be written using a pencil and is the hash of the last commit on that branch.

Let us start by examining the .git/refs/heads directory, and then create a new branch using git-branch, and inspect that directory once again.

 $ (master) tree .git/refs/heads (1)
 # .git/refs/heads
 # └── master
 #
 # 0 directories, 1 file
 $ (master) cat .git/refs/heads/master (2)
 # aed7e05f8b3fc115c1c2507c79454c002383e9ee
 $ (master) git branch featureBranch (3)
 $ (master) tree .git/refs/heads (4)
 # .git/refs/heads
 # ├── featureBranch
 # └── master
 #
 # 0 directories, 2 files
 $ (master) cat .git/refs/heads/featureBranch (5)
 # aed7e05f8b3fc115c1c2507c79454c002383e9ee
....
<1> List the files under the .git/refs/heads
<2> Inspect the contents of the master file
<3> Create a new branch using git-branch
<4> List the files under .git/refs/heads again to see a new file
<5> Display the contents of the newly created file

Recall that by default Git creates a master branch for our repository. Listing the files under the .git/refs/heads directory reveals a file with exactly that name. Furthermore, .git/refs/heads/master happens to be a plain text file that contains exactly one line of text — which is the hash of the latest commit on the master branch.

We then create a new branch using the git-branch command supplying it with the name of the branch. Inspecting .git/refs/heads once again reveals that a new file now resides beneath it — and the name of the file just so happens to be the name of the newly created branch. Inspecting the contents of .git/refs/heads/featureBranch tells us that it too contains the same hash as the master file — or in other words the hash of the latest commit on that branch.

In this illustration I have added how branches play into the DAG. You will notice that this is a slightly different version of the illustration that we saw in Part I of this series — in that I have stripped out the trees that the commits point to, and correspondingly sub-trees and blobs. This will allow us to focus on the DAG.

Figure 1. Git Branches

As you can see Git branches are simply pointers, or references — that point to commit object using their hashes. Each branch has two parts to it — the name of the branch, and the commit it points to.

Now, what did I mean earlier when I spoke of permanent markers and pencils and sticky notes? It turns out that we can answer that question simply by making another commit. Let us do that, shall we?

 $ (master) git status (1)
 # On branch master
 # nothing to commit, working directory clean
 $ (master) echo 'Making a third commit on master' >> README.md (2)
 $ (master) git add README.md
 $ (master) git commit -m "Third commit" (3)
 # [master a509575] Third commit
 # 1 file changed, 1 insertion(+)
 $ (master) git lg (4)
 # * a509575        -  (HEAD, master) Third commit <Raju Gandhi>
 # * aed7e05        -  (featureBranch) Second commit <Raju Gandhi>
 # * 3cf00f8        -  Initial commit <Raju Gandhi>
....
<1> Git status tells us we are on the master branch and the working directory is clean
<2> Make an edit
<3> Commit the edit
<4> Display the abbreviated git log

The abbreviated Git log tells us that the master branch is one commit ahead of featureBranch. If you recall our discussion from Part I of this series article you know that when we made our latest commit Git created a new commit object. This commit has a calculated hash of a509575 (or a509575203205931cbcfc5a21d11c395ffbdced4 to be precise) and has a pointer to its parent commit which happens to be aed7e05. Git also took the sticky note with master on it, erased the hash that was previously written on it and replaced it with a509575203205931cbcfc5a21d11c395ffbdced4. You can verify this by simply cat-ing .git/refs/heads/master and .git/refs/heads/featureBranch

 $ (master) cat .git/refs/heads/master
 # a509575203205931cbcfc5a21d11c395ffbdced4
 $ (master) cat .git/refs/heads/featureBranch
 # aed7e05f8b3fc115c1c2507c79454c002383e9ee

You can visualize the net effect in the following illustration.

Figure 2. The DAG after a commit on master

It really is that simple! Git simply adds to the DAG just as we expected it to, and updates the appropriate references (written in pencil). The name of the branch needs no updating, hence in our analogy the name can be seen as written with a permanent marker. ^[2]

You should also note that the commit has no knowledge of any of the references that point to it — that information is maintained outside the DAG.

Quiz time — can you visualize what were to happen if I checked out featureBranch and made a commit on that branch? Git creates a new commit with aed7e05f8b3fc115c1c2507c79454c002383e9ee as the parent, then updates the featureBranch sticky note with the hash of the latest commit on that branch. Take a look.

Figure 3. The DAG after a commit on featureBranch

You see how the code diverges away from master.

What if we were to delete a branch, say master using git branch -D master? ^[3] Git simply takes the sticky note with master on it, crumples it and throws it away! On inspecting the .git/refs/heads directory you will see that the master file has indeed been deleted.

You might wonder about the commit that master was referencing prior to being deleted. In our particular scenario you can see that if the master sticky note disappears there is nothing referencing the latest commit on that branch. Git will eventually ^[4] throw that commit away as well. Note that all other commit objects in the DAG have a reference to it — that could be a sticky note or child commit treating it as its parent. As long as a commit object has a hard reference to it, Git will keep it around, else it will be garbage collected.

Figure 4. The DAG after deleting master

In this section we saw how git-branch affects the DAG, and how operations like git-commit and deleting branches affect the DAG.

One thing you might have been wondering about all along is — how does Git know which branch to work on? Let us take a look, shall we?

`git-checkout`

Whenever we wish to work on a particular branch in Git we have to check it out. What does this mean in terms of the DAG, and is there more to it than meets the eye?

Our leading character for this section is the HEAD file that resides directly beneath the .git directory. Let us start by inspecting the HEAD file, then checkout (or switch) branches and see what happens. (Please note that if you have been following along on the terminal you should have featureBranch checked out and we will need to create another branch just so we can switch to it since we deleted master)

 $ (featureBranch) git branch master (1)
 $ (featureBranch) cat .git/HEAD (2)
 # ref: refs/heads/featureBranch
 $ (featureBranch) git checkout master (3)
 # Switched to branch 'master'
 $ (master) cat .git/HEAD (4)
 # ref: refs/heads/master
....
<1> Recreate master
<2> List the contents of .git/HEAD
<3> Switch branches
<4> Inspect .git/HEAD again

First things first — the .git/HEAD file tells Git what the HEAD currently points to. Furthermore, it turns out that the HEAD file, unlike the refs files does not seem to contain a hash. Rather, it seems to point to a reference!

Another way to think about this is that the HEAD is a symbolic reference, in that it does not directly point to a hash, rather it points to the reference that represents the currently checked out commit.

You can visualize how the HEAD works as shown here (I have truncated the diagram for brevity)

Figure 5. featureBranch is checked out

After checking out master this is how the DAG would look

Figure 6. master is checked out

As you can see, whatever HEAD points to represents what is “checked” out. But there is more to that than meets the eye.

The most important thing to bear in the mind about the HEAD is that the HEAD will always represent the parent of the next commit. There is no exception to this rule.

Knowing this, can you see how making a commit now will work? Git will kick off all the machinery that is needed to calculate the hashes of the blobs, trees, and finally the commit. It will use the commit that HEAD points to, and make that commit the parent of the next commit. Now that the commit is a member of the DAG, Git will simply rewrite the master sticky note with the hash of the new commit.

Does the HEAD need updating? No! It continues to point to the master reference.

Knowing that the HEAD will always be the parent of the next commit has a few implications. If you have ever committed on the wrong branch then it was because you lost track of your HEAD (pun intended). Liberal use of git-status is a good way to avoid the aforementioned problem. An alternative is to combine the use of git-prompt.sh along with some bash prompt trickery to always have the branch you have checked out visible when working at the terminal.

There is yet another powerful, and often nerve-racking (especially for newcomers to Git) facet to the HEAD. For a minute let us consider what happens when we git-checkout a branch. Git looks in the .git/refs/heads directory to find the file that matches the name of the branch we wish to check out and identifies the hash that that branch currently points to. It then looks in the .git/refs/objects directory and finds the commit object that the hash represents and “unfolds” it — in that it finds the tree the commit points to and recreates the working directory as represented by that tree object. Finally, it rewrites .git/HEAD file to symbolically point to the newly checked out branch.

If you were to boil down the git-checkout lookup algorithm to its essence you could think of Git as checking out a hash!

We are programmers, and now we are curious — what if we were to checkout a hash? What happens? Let us find out, shall we?

Checking out a particular hash

 $ (master) git lg (1)
 # * 40ee28b        -  (HEAD, master, featureBranch) Some commit <Raju Gandhi>
 # * aed7e05        -  Second commit <Raju Gandhi>
 # * 3cf00f8        -  Initial commit <Raju Gandhi>
 $ (master) git checkout aed7e05 (2)
 # Note: checking out 'aed7e05'.
 #
 # You are in 'detached HEAD' state. You can look around, make experimental
 # changes and commit them, and you can discard any commits you make in this
 # state without impacting any branches by performing another checkout.
 #
 # If you want to create a new branch to retain commits you create, you may
 # do so (now or later) by using -b with the checkout command again. Example:
 #
 #   git checkout -b new_branch_name
 #
 # HEAD is now at aed7e05... Second commit
 ....
<1> Abbreviated git log
<2> Pick the second commit the check it out

We start by looking at the log (just so we can pick a commit hash at random) and then proceed to check it out. Git informs us that we are in detached HEAD state — we will see what that means in a minute.

Before we proceed I want you to read the warning that Git emitted when we checked out aed7e05. Done? Moving on then …

First things first, what does HEAD point to? That one is easy — we can simply cat .git/HEAD

HEAD in detached HEAD state

 $ ((aed7e05...)) cat .git/HEAD
 # aed7e05f8b3fc115c1c2507c79454c002383e9ee

Aha! Now .git/HEAD points directly to a hash instead of symbolically pointing to one via a reference. Let us attempt to visualize how this looks.

Figure 7. Detached HEAD state

As you can see HEAD now points to a commit directly.

Knowing this, and that HEAD will always point to the parent of the next commit, can you visualize what were to happen if were to make a commit at this point? Let us quickly make a commit, and then lay out the DAG so we can conceptualize how the DAG changed.

 $ ((aed7e05...)) echo 'In Detached HEAD state' >> README.md (1)
 $ ((aed7e05...)) git add README.md (2)
 $ ((aed7e05...)) git commit -m "Making a commit in detached HEAD state" (3)
 # [detached HEAD ff21829] Making a commit in detached HEAD state
 #  1 file changed, 1 insertion(+)
 $ ((ff21829...)) git lg (4)
 # * ff21829        -  (HEAD) Making a commit in detached HEAD state <Raju Gandhi>
 # | * 40ee28b      -  (master, featureBranch) Some commit <Raju Gandhi>
 # |/
 # * aed7e05        -  Second commit <Raju Gandhi>
 # * 3cf00f8        -  Initial commit <Raju Gandhi>
 ....
 <1> Make an edit
 <2> Add the file to the index
 <3> Make a commit
 <4> Git log

We see we have a new commit (ff21829) that HEAD now points to. Any ideas on the DAG?

Figure 8. Detached HEAD state

We are one step away from truly understanding what the “detached” in detached HEAD means. Answer this question — what happens if were to git-checkout master or featureBranch (or for that matter any other commit?) If we are to checkout another commit then the HEAD would directly or indirectly point to that commit — and leave ff21829 behind! Who points to ff21829 then? No one! Which means that when Git’s garbage collector comes around (and it will) our newly created commit will disappear.

Another way to think about detached HEAD state is to think of it as being on an anonymous branch. See, when we have the HEAD pointing directly to a commit Git continues to behave like if were working with a “named” branch — except when we check something else out. At that point there is a small chance that if we are not careful the commit that HEAD was pointing to may not have anything else pointing to it. And we know what happens to commits that have no hard references to them, yes?

What are the chances that we will leave a commit behind? Let us check out master right now and see what happens.

Leaving a commit behind

 $ ((ff21829...)) git checkout master
 # Warning: you are leaving 1 commit behind, not connected to
 # any of your branches:
 #
 #   ff21829 Making a commit in detached HEAD state
 #
 # If you want to keep them by creating a new branch, this may be a good time
 # to do so with:
 #
 #  git branch new_branch_name ff21829
 #
 # Switched to branch 'master'

Git ever so nicely warns us that we are indeed leaving ff21829 behind, and if we do wish to keep it around it may serve us well to create a new branch. It even tells us how to go about doing it. In essence Git is telling us to create a sticky note as a reminder of the commits hash!

Keeping track of the HEAD when working in Git is essential since it dictates where our changes will eventually end up in the DAG. However, Git allowing us to move the HEAD to any arbitrary commit allows us to be playful — we can checkout any other state of our repository for quick and dirty experimentation or debugging. If we like what we see we can simply create a new branch and keep our changes around a little bit longer, or simply checkout some other commit and be on our merry way knowing that Git’s garbage collector will come around and clean up our mess for us.

Conclusion

Reiterating what I said about Git at the end of Part I — Git’s power comes from simplicity. The DAG represents the fundamental datastructure that Git uses to store our repository’s history — and all commands that we love and use in Git affect that DAG. We now understand how the DAG is built, and we understand how a few commands operate on that DAG.

Take a look at any of Gits man-pages for git-merge, git-rebase or what-have-you — you will see references to the DAG everywhere.

Where do we go from here? I suggest the next time you issue start to work with Git you keep a mental picture of the DAG in your mind’s eye. The next time you are about to issue a command to Git attempt to visualize what the DAG will look like after the command executes, then attempt to find out ^[5] if you got it right.

Till next time, May the DAG be with you.

1. You can get a similar log output using

git log --graph --all --full-history --color --pretty=format:'%x1b[31m%h%x09%x1b[32m %C(white)- %d%x1b[0m%x20%s %C(bold blue)<%an>%Creset'

. I have the same aliased to git lg

2. Git allows for the renaming of branches but for now we can ignore that, and thus allow our analogy to live just that little longer

3. You will need to supply the -D (uppercase) flag in this case since Git will complain of master not being fully merged

4. Git has a garbage collection process for cleaning up dereferenced objects in the data-store, but that is a discussion for another day. For now knowing that there is some cleanup that Git does regularly suffices for our discussion.

5. Tools like Attlasian’s SourceTree or gitk can prove to be handy here

LooselyTyped

(map blog thoughts)

Gits Guts: Part II

Initial set up

`git-branch`

`git-checkout`

Conclusion