Git Theory - 1 - Origins story and terms

03.10.2020 — notes, git — 4 min read

This is a series of posts on Git, mostly theoretical with little practical examples. It all started couple of years back(2018), at work i thought of taking a presentation on Git and started to prepare a PPT for it but it never did happen due to more work, handling issues and other activites. But according me to, i really did a good work in gathering up all information on git which ended up sleeping in a PPT which actually i refer often. After getting my site up, i had this in my todo list to transfer Git PPT stuff to Web as it will be very much easier to refer.

Here i have split up the content into multiple posts as below,

Origin : How it all began. What is git ? and Terminologies used in this series.
Basics : config, init, add, rm, .gitignore, commit, log, blame, diff, tag, describe, show and stash
Undos : checkout, reset, revert and restore
Branching : Git Branching
Internals : Git Internals
Collaboration : Git remote repository
Git Everyday : Git flowchart, shortcuts and references

# Origin story

In simple words,

In 2005, Linus Torvalds urgently needed a new version control system to maintain the development of the Linux Kernel. So he went offline for a week, wrote a revolutionary new system from scratch, and called it Git.

Here is the announcement, subject:Kernel SCM saga.. one, two.

If you have read the above mail, its clear that he liked BitKeeper, but was frustrated that Linux could no longer use it and that he was unimpressed by the competition and the outcome was Git.

'Git' is a made up name. In british slang, it means 'stupid person'. There is another made-up acronym for git its 'Global Information Tracker'—but that’s really a 'backronym'.

Here are few words from the author himself Linus Torvalds : Latest commit e83c516 on Apr 8, 2005

1GIT - the stupid content tracker
2
3"git" can mean anything, depending on your mood.
4
5 - random three-letter combination that is pronounceable, and not
6   actually used by any common UNIX command.  The fact that it is a
7   mispronounciation of "get" may or may not be relevant.
8 - stupid. contemptible and despicable. simple. Take your pick from the
9   dictionary of slang.
10 - "global information tracker": you're in a good mood, and it actually
11   works for you. Angels sing, and a light suddenly fills the room. 
12 - "goddamn idiotic truckload of sh*t": when it breaks
13
14This is a stupid (but extremely fast) directory content manager.  It
15doesn't do a whole lot, but what it _does_ do is track directory
16contents efficiently. 
17
18There are two object abstractions: the "object database", and the
19"current directory cache".
20
21    The Object Database (SHA1_FILE_DIRECTORY)
22
23The object database is literally just a content-addressable collection
24of objects.  All objects are named by their content, which is
25approximated by the SHA1 hash of the object itself.  Objects may refer
26to other objects (by referencing their SHA1 hash), and so you can build
27up a hierarchy of objects. 
28
29There are several kinds of objects in the content-addressable collection
30database.  They are all in deflated with zlib, and start off with a tag
31of their type, and size information about the data.  The SHA1 hash is
32always the hash of the _compressed_ object, not the original one.
33
34In particular, the consistency of an object can always be tested
35independently of the contents or the type of the object: all objects can
36be validated by verifying that (a) their hashes match the content of the
37file and (b) the object successfully inflates to a stream of bytes that
38forms a sequence of <ascii tag without space> + <space> + <ascii decimal
39size> + <byte\0> + <binary object data>. 
40
41BLOB: A "blob" object is nothing but a binary blob of data, and doesn't
42refer to anything else.  There is no signature or any other verification
43of the data, so while the object is consistent (it _is_ indexed by its
44sha1 hash, so the data itself is certainly correct), it has absolutely
45no other attributes.  No name associations, no permissions.  It is
46purely a blob of data (ie normally "file contents"). 
47
48TREE: The next hierarchical object type is the "tree" object.  A tree
49object is a list of permission/name/blob data, sorted by name.  In other
50words the tree object is uniquely determined by the set contents, and so
51two separate but identical trees will always share the exact same
52object. 
53
54Again, a "tree" object is just a pure data abstraction: it has no
55history, no signatures, no verification of validity, except that the
56contents are again protected by the hash itself.  So you can trust the
57contents of a tree, the same way you can trust the contents of a blob,
58but you don't know where those contents _came_ from. 
59
60Side note on trees: since a "tree" object is a sorted list of
61"filename+content", you can create a diff between two trees without
62actually having to unpack two trees.  Just ignore all common parts, and
63your diff will look right.  In other words, you can effectively (and
64efficiently) tell the difference between any two random trees by O(n)
65where "n" is the size of the difference, rather than the size of the
66tree. 
67
68Side note 2 on trees: since the name of a "blob" depends entirely and
69exclusively on its contents (ie there are no names or permissions
70involved), you can see trivial renames or permission changes by noticing
71that the blob stayed the same.  However, renames with data changes need
72a smarter "diff" implementation. 
73
74CHANGESET: The "changeset" object is an object that introduces the
75notion of history into the picture.  In contrast to the other objects,
76it doesn't just describe the physical state of a tree, it describes how
77we got there, and why. 
78
79A "changeset" is defined by the tree-object that it results in, the
80parent changesets (zero, one or more) that led up to that point, and a
81comment on what happened. Again, a changeset is not trusted per se:
82the contents are well-defined and "safe" due to the cryptographically
83strong signatures at all levels, but there is no reason to believe that
84the tree is "good" or that the merge information makes sense. The
85parents do not have to actually have any relationship with the result,
86for example.
87
88Note on changesets: unlike real SCM's, changesets do not contain rename
89information or file mode chane information.  All of that is implicit in
90the trees involved (the result tree, and the result trees of the
91parents), and describing that makes no sense in this idiotic file
92manager.
93
94TRUST: The notion of "trust" is really outside the scope of "git", but
95it's worth noting a few things. First off, since everything is hashed
96with SHA1, you _can_ trust that an object is intact and has not been
97messed with by external sources. So the name of an object uniquely
98identifies a known state - just not a state that you may want to trust.
99
100Furthermore, since the SHA1 signature of a changeset refers to the
101SHA1 signatures of the tree it is associated with and the signatures
102of the parent, a single named changeset specifies uniquely a whole
103set of history, with full contents. You can't later fake any step of
104the way once you have the name of a changeset.
105
106So to introduce some real trust in the system, the only thing you need
107to do is to digitally sign just _one_ special note, which includes the
108name of a top-level changeset.  Your digital signature shows others that
109you trust that changeset, and the immutability of the history of
110changesets tells others that they can trust the whole history.
111
112In other words, you can easily validate a whole archive by just sending
113out a single email that tells the people the name (SHA1 hash) of the top
114changeset, and digitally sign that email using something like GPG/PGP.
115
116In particular, you can also have a separate archive of "trust points" or
117tags, which document your (and other peoples) trust.  You may, of
118course, archive these "certificates of trust" using "git" itself, but
119it's not something "git" does for you. 
120
121Another way of saying the same thing: "git" itself only handles content
122integrity, the trust has to come from outside. 
123
124    Current Directory Cache (".dircache/index")
125
126The "current directory cache" is a simple binary file, which contains an
127efficient representation of a virtual directory content at some random
128time.  It does so by a simple array that associates a set of names,
129dates, permissions and content (aka "blob") objects together.  The cache
130is always kept ordered by name, and names are unique at any point in
131time, but the cache has no long-term meaning, and can be partially
132updated at any time. 
133
134In particular, the "current directory cache" certainly does not need to
135be consistent with the current directory contents, but it has two very
136important attributes:
137
138 (a) it can re-generate the full state it caches (not just the directory
139     structure: through the "blob" object it can regenerate the data too)
140
141     As a special case, there is a clear and unambiguous one-way mapping
142     from a current directory cache to a "tree object", which can be
143     efficiently created from just the current directory cache without
144     actually looking at any other data.  So a directory cache at any
145     one time uniquely specifies one and only one "tree" object (but
146     has additional data to make it easy to match up that tree object
147     with what has happened in the directory)
148    
149
150and
151
152 (b) it has efficient methods for finding inconsistencies between that
153     cached state ("tree object waiting to be instantiated") and the
154     current state. 
155
156Those are the two ONLY things that the directory cache does.  It's a
157cache, and the normal operation is to re-generate it completely from a
158known tree object, or update/compare it with a live tree that is being
159developed.  If you blow the directory cache away entirely, you haven't
160lost any information as long as you have the name of the tree that it
161described. 
162
163(But directory caches can also have real information in them: in
164particular, they can have the representation of an intermediate tree
165that has not yet been instantiated.  So they do have meaning and usage
166outside of caching - in one sense you can think of the current directory
167cache as being the "work in progress" towards a tree commit).

# Whats a Git

Git is a object store software that tracks file changes
All files are refered to as objects and each object has a unique Hash ID.
All files stored in git are compressed.
Git records the current state of the project by creating a tree graph. It is usually in the form of a Directed Acyclic Graph (DAG).

There is definitely a learning curve in knowing git concepts, commands and way to collaborate. Its already been here for 10 years and it had become an essential, mandatory skill that every developer need to know. Difficulty with git is there are so many commands and different way to do things.

Git commands are categorized into two types.
- Porcelain commands – this is what most users use
- Plumbing commands - for expert users

Most of the time, we will be working with Porcelain commands but its good to know whats plumbing commands are, it sort gives you a depth feel and gives an idea How git works internally.

# Concepts & Terminologies

Git is a Distributed Version Control and this is how it look like.
- Every contributor has a local copy or 'clone' of the main repository
- Users can update their local repositories with new updated code in central server by an operation called 'pull'
- Distributed repository(main) can be updated by an operation called 'push' from their local repository.
git terminology / git glossary
- Working directory / Working tree
  - This is your working directory literally, this is where you change your code and test, it contains all tracked and untracked files.
- Staging area / INDEX
  - Contains all the files which are ready for next commit. This is the place where your files become objects. When you files come to this area you can say 'those files are staged'. Index is found in .git/index
- Local Repository
  - Its a local standalone repository meaning you are the only one contributing to it. All the objects are stored in a hidden folder called .git, that folder is what call repository or object store which contains complete copy of the code that been added, commited or stashed.
- HEAD
  - It is said HEAD usually points to the tip of the branch meaning it poitns to the latest commit on that branch. When HEAD is referencing a arbitrary commit then its said to be in detached state. Instead of typing HEAD you can mention @
- ORIG_HEAD
  - ORIG_HEAD is previous state of HEAD, set by commands that have possibly dangerous behavior, to be easy to revert them. It is less useful now that Git has reflog: HEAD@{1}. HEAD@{1} is always last value of HEAD.
- Remote repository
  - The mainstream. It could be Github/Bit bucket/Cloud repository basically its a remote file server that you use to store and share your code.
- Upstream
  - It is a remote repository you want to contribute to.
- Downstream
  - When you copy code via clone/checkout that becomes downstream and eventually after you make changes you usually want to send them back "upstream" so they make it into mainstream.
- Clone
  - Process of copying Git repository along with its history to the local machine is called Cloning. Also, when you clone Git assumes you are a user of the repository.
- Bare repository
  - A remote repository that doesn't require a working directory meaning there is no development or any updates happening locally. In simple words, a Git repository without a working tree is called a bare repository. You can create such a repository with the --bare option in git init or git clone.
```
1# create a bare repository
2git init --bare
```

In Git, a file can be in any of the following states

staged: File to be included in the next commit
tracked: File that is committed (or) staged will be tracked. If a file is modified but not staged, it can be refered to as dirty.
untracked: File that is not staged and not ignored.

And thats it about Git origins and terminologies.

# Next steps

Basics : config, init, add, rm, .gitignore, commit, log, blame, diff, tag, describe, show and stash
Undos : checkout, reset, revert and restore
Branching : Git Branching
Internals : Git Internals
Collaboration : Git remote repository
Git Everyday : Git flowchart, shortcuts and references