Version Control Systems: in Practice, now, Git

Organisation:Copyright (C) 2021-2022 Olivier Boudeville
Contact:about (dash) howtos (at) esperide (dot) com
Creation date:Saturday, November 20, 2021
Lastly updated:Thursday, December 1, 2022

Overview

No real software development shall happen without the use of a VCS - standing for Version Control System - of some sorts, notably in order to track the versions of the source files involved and to ease the collaborative work on them.

Many solutions have been defined for this purpose (CVS, Clearcase, SVN, Mercurial, etc.), but now a single tool is the de facto standard: Git, which is a distributed version control system available as free software; refer to its website for more details.

Git Usage

Beyond the documentation relative to its general use, projects have to adopt their own set of conventions - regarding the management of branches, commits, tags, etc - based on their preferences and context.

Basic Operations

  • managing branches:
    • creating branches is done thanks to the checkout command, often abbreviated as co here
    • to create a branch deriving from the current one (the current HEAD) and switching to it at the same time (performing its co, while inheriting any local changes): git co -b my_new_branch
    • to create a local branch corresponding to a remote one (let's suppose it is named some_branch), assuming that a remote server (ex: my_remote, possibly origin) has already been declared (ex: git remote add my_remote URL):
      • first step is to update the remote-tracking branches with git fetch my_remote, then to create the target local branch tracking that remote (upstream) one with: git co -b some_branch my_remote/some_branch (also switching to it here)
      • a shortcut is to use git co --track my_remote/some_branch instead
      • even shorter, if the name of the target local branch name does not exist yet and matches exactly matches a name on only one remote, git co some_branch will suffice
    • to delete a local branch: git branch -d my_branch
  • managing tags:
    • to list (local) tags: git tag
    • to have information about an already-existing tag: git show my_tag
    • to set a new annotated tag: git tag -a foobar-version-2.4.0 -m "Release of the version 2.4.0 of Foobar."; prefer naming tags differently from branches (ex: foobar-version-2.4.0 rather than foobar-2.4.0) to spare ambiguities to Git
    • a set tag must be specifically pushed on a remote, for example: git push origin my_tag; all tags can be pushed with git push --tags (the remote can be implied)
    • to delete a tag that was not pushed: git tag --delete my_tag
  • determining whether a file is in VCS, knowing that due to .gitignore rules, update-index --skip-worktree, etc. it is not always obvious:
# Target file is tracked iff is listed by:
$ git ls-files | grep my_file

# Or, in order to trigger an error if this target file is not tracked:
git ls-files --error-unmatch my_file
  • getting the version of a file as it was at a given commit:
# Replaces the current version of that file by the designated one:
$ git checkout COMMIT_ID path/to/the/target/file

# Outputs on the console the designated version:
$ git show COMMIT_ID:path/to/the/target/file

# Outputs on the console the diff between the designated
# version and the current one:
$ git show COMMIT_ID path/to/the/target/file
  • listing the files modified by a given commit: git show --name-only MY_COMMIT_ID

Managing Branches

Creating branches allows to separate threads of work (while preserving their lineage) and progress concurrently. Yet often their content will have to converge ultimately; depending on the intent, two use cases can be considered, resulting in different Git uses.

Merge versus Rebase

Here one may want:

  • either to integrate back a development branch (ex: my-feature) in a shared, parent one (ex: master): then one shall prefer using merge, in order to keep separate histories and not affect the past one of the shared branch
  • or to resynchronise a development branch (ex: my-feature) on the last version of a shared branch and continue these developments: then one shall prefer rebase, so that the history of the development branch contains only its own changes (less noise, linear history)

In practice, in order to transfer the changes of a branch A in a branch B:

$ git co B

# Either first case (integrate development A in master B):
$ git merge A # or: git pull A

# Or second one (resynchronise development B on master A):
$ git rebase A # or: git pull -rebase A

How such a last rebase of branch A in branch B is done? The bifurcation point of B compared to A is moved from its initial position to the current head of A, on which all changes recorded in B are applied; the resulting history of B looks like if these changes had been directly performed from the version of A designated in this rebase, and thus B can be then directly fast-forwarded to its tip, which comprises both the changes synchronised from A and, then, the ones specifically introduced in B.

Then, to update the remote with these post-rebase commits, git push --force-with-lease shall be used [1].

[1]Rather than just performing just a push, having it fail, pulling, and ending up with duplicates of the changes. Should this happen, rewind these changes, for example with: git reset --hard <full_hash_of_commit_to_reset_to>.

More information: [1] or, in French: [2], [3], [4].

Directly Transferring Changes

Sometimes, one may want to directly transfer the changes of a derivate branch B in a parent branch A. When one knows for sure that the versions in B shall be preferred in all cases to their counterparts in A (note that a classical merge is already fully able to manage fast-forwards), one may use:

$ git checkout A
$ git merge -X theirs B

No conflict should arise (source).

The same is possible with rebase; for example: git rebase -X theirs B.

Note that -X a strategy option, whereas -s would be a merge strategy option.

Using here ours rather than theirs :

  • -X ours uses "our" version of a change only when there is a conflict
  • whereas -s ours ignores the content of the other branch entirely (in all cases), and use "our" version instead

Another way of forcing the content of a branch B to be the same as the one of a branch A is, while B is checked-out, to execute: git reset --hard A. As mentioned previously, push shall be done then with git push --force-with-lease.

Common Procedures

Overcoming auto-signed SSL certificate issues

To avoid, typically in a company internal setting, errors like:

Cloning into 'XXX'...
fatal: unable to access 'https://foo.bar.org/XX/XXX/': SSL certificate problem: self signed certificate in certificate chain

the http.sslVerify=false option may be used, even if it weakens the overall security.

This is typically useful initially:

$ git -c http.sslVerify=false clone https://foo.bar.org/XX/XXX

In order that the next operations (ex: future pushes) overcome too this problem for the current repository, use from within the current clone:

$ git config http.sslVerify false

Setting the right metadata for the next commits

Doing so prevent from having to amend commits a posteriori.

If these information apply for all projects:

$ git config --global user.name "John Doe"
$ git config --global user.email john.doe@foobar.org

Otherwise shall be done at least on a per-project basis with:

$ git config user.name "John Doe"
$ git config user.email john.doe@foobar.org

Also git config --global --edit may be of use (beware to trigger a vi by accident...).

Performing operation on remotes with no systematic authentication

Using a SSH key pair, hence with its public key declared on said remote, is a relevant approach, safer than from example using a ~/.netrc file.

Updating One's Fork from its Upstream

So you forked a repository (let's say it is in https://github.com/some_project/some_repo.git) and made progress - yet in the meantime the upstream repository may also have been updated, and you want to integrate these changes in yours.

First step is to ensure that this repository (designated here as upstream for convenience) is locally known:

$ git remote add upstream https://github.com/some_project/some_repo.git

Then, from a fully-committed clone of your fork (let's suppose we are using the main branch in all repositories):

$ git fetch upstream

# More appropriate than a merge:
$ git rebase upstream/main

# Repeatedly, as long as conflicts are found:
$ git rebase --continue

# Forced, as otherwise the current branch will deemed to be behind our remote:
# (hopefully your branch at origin is not protected by a hook; otherwise:
# 'git checkout -b some_branch', etc.)

$ git push -f origin main

Creating an empty branch

Rather than creating it from a pre-existing branch and removing all inherited content, prefer:

$ git checkout --orphan my_new_branch

(typically useful for GitHub Pages branches; may then be followed by some adds and git commit --allow-empty -m "Initial website.")

Listing differences with prior versions of a file

In order to list the differences of a given file with the previous commits (precisely: of a set of pathspecs), one may use our dif-prev.sh script, which by default reports the differences with the last committed version. With the --all option, it lists all differences, until the first addition of this file.

Preventing the commit of a file in VCS that is often locally modified

One should use this method:

$ git update-index --skip-worktree <file-list>

The opposite operation is:

$ git update-index --no-skip-worktree <file-list>

Listing the files managed in VCS from the current directory

Use git ls-files to determine the files that are already managed in VCS, recursively from the current directory.

To list the untracked files (i.e. the files not in VCS), use git ls-files --others.

Reducing the size of a repository

One may use our list-largest-vcs-blobs.sh script to detect any larger files that should not be in VCS (ex: should a colleague have committed by mistake a third-party archive, or unexpected data such as CSV files).

Then install BFG Repo-Cleaner:

$ mkdir -p ~/Software/bfg-repo-cleaner/
$ cd $_
$ mv ~/bfg-1.14.0.jar .
$ ln -s bfg-1.14.0.jar bfg.jar
# For example in ~/.bashrc:
$ alias bfg="java -jar ~/Software/bfg-repo-cleaner/bfg.jar"

All developers should be asked to commit their sources (git add + push), to archive their clone (ex: in a timestamped .xz file like 20220412-archive-clone-foobar.tar.xz), and to wait until notified that they can create a new clone.

The repository may be then cleaned up (ex: from large, unnecessary CSV files) in isolation, with:

$ git clone --mirror XXX/foobar.git
$ bfg --delete-files '*.csv' foobar.git
$ cd foobar
$ git reflog expire --expire=now --all && git gc --prune=now --aggressive
$ git push

Then all developers shall be requested to perform a new clone and to check the fetched content (ex: with regard to the content of the last branch in which they committed).

Fixing LF vs CRLF End of Line Problems

Use Git Attributes to specify proper files and paths attributes.

One may define a .gitattributes file for example with *.js eol=lf, * text=auto, or:

# No CRLF conversion for DOS/Windows batch files.
# They should be stored with the CRLF line terminators.
#
*.bat -crlf

Fixing a commit message

If no push was done, it is as simple as replacing the former message by a new one, like in:

$ git commit --amend -m "This is a fixed commit message."

Tools

On Most Platforms

At least on UNIX, the command-line Git client (git) is certainly the best tool. In difficult situations, graphical tools such as gitk may be of help.

See also our Ceylan-Hull section about VCS-related scripts.

On Windows

Tools like TortoiseGit may foster a view on the usage of Git that is a bit particular, conflating concepts or introducing extra ones (ex: a sync command). Apparently also at least some pulls did not reintroduce files just removed from the working directory.

More generally, cloning on a Windows host an UNIX-originating repository comprising symbolic links may induce oddities (ex: a symlink named S pointing to Foobar resulting, on a Windows clone, in a file named S whose content is, literally, the text "Foobar", instead of the expected content of the Foobar file).

Another option is to use Visual Studio Code (vscode), which supports natively Git (provided that the command-line version is already installed). One may select View -> SCM (or Ctrl-Shift-G) for that. Clicking on the "VCS" icon (three rings links by two curves; the third from the top) displays a contextual view offering various associated operations (here based on Git).

We finally preferred using MSYS2 + Git rather than Git Bash, named "Git for Windows"; hints to speed up these tools may apply.

Inner Workings

Git stores internally every version of every file separately (not as a diff with a parent version) as a blob (an opaque binary content) identified by its (SHA1) hash.

A commit is the identifier of a tree representing the filesystem of interest at a given moment (snapshot). This tree references the files through their SHA1, similarly to a Merkle tree.

A branch is thus nothing but a pointer on a given commit, and HEAD designates the current branch. Git stores natively only blobs, trees and commits.

The reported differences in the content of a file or a tree are thus only recreated (established dynamically) by Git commands, they are not natively tracked.

Translations

From English to French:

Documentation

Many pointers exist, doing a great job in unveiling how Git is to be used.

In English, Pro GIT is surely a reference.

In French: