GIT AND GITHUB: TOOLING YOUR SOFTWARE WORKFLOW AND DEVELOPMENT LIFE CYCLE

History of Software Engineering

Since the 1940s when software engineering started close on hardware level until today on various technologies like embedded systems, web and mobile apps the demand on development process has increased tremendously. As a result in the 1980s the cost of owning and maintaining was twice as expensive as developing software.

The internet allows a greater sharing of interests and social encounters.

The rise of the internet in the 1990s the situation has been worsened. With the internet corporations had the possibility to co-operate and exchange information in a more international and efficient way. Hence to that software solutions increased by its complexity and requirements.

During that time the costs increased by 30% and three quarters of large software projects were failures or did not meet customer requirements.

Commercial and Free Software Development Movements

While most software projects were commercially driven in the 1970s Richard Stallman left the MIT to developed some programming tools and provide it for free to other developers. Around the same time Bill Joy and Chuck Haley developed new unix tools. They delivered them with the source code to other hackers allowing anyone to review, fix or enhance the code. As an output with any release many ideas, fixes and changes was delivered.

Unlike companies embedding developers into their organization the open source communities are less formal. Developers with the same interest or working on the same problem are building a community in terms of a loose network for sharing their knowledge and work – or even to dis-continue and to implement special flavors based on the work of others with the possibility to build a new fork with other developers.

Central and De-Central Organization Workflows

Both commercial and non-commercial respectively proprietary and open source software development differs greatly in terms of workflow and motivation. On the other hand they cope with the same tasks and problems in their daily operations.

Nowadays companies use Subversion as the most preferred Source Control Management tool for setting up a central repository and workflow while GitHub is the most popular cloud and de-centralized repository being used by the open source communities.

Organization Workflows and Their Repositories

In a team a member checks in its work results and source codes in a shared repository. Ideally the repository covers the structure and process of its organization.

Central and Distributed Workflows

Nowadays the majority in software development uses a centralized workflow like this:
A Centralized Workflow

In a centralized workflow the repository acts as a shared hub. Each developer can access, retrieve the sources from there as local copies to their file system and commit their local work back to the central repository.

Changes are managed as revision numbers in the repository. Changes made while it has been changed by another developer at the same time are discarded. The developer is then requested to merge its work with the latest changes before being able to commit to the repository.
This requires a team to co-ordinate their work to minimize these extra efforts (separate work packages, plan dependences and responsibilities by order and time, etc.).

Integration-Manager Workflow

In open source development – where developers often work at their spare time and are separated from each other by locations and time zones the independent work and flexibility is a key factor in the workflow. Here an integration manager workflow is commonly used:

In this workflow a central, blessed repository is owned by an integration manager and developers have only read access. Each developer manages its own (local) repository as a copy from the blessed repository with read access for others.

To a certain extend a developer is free to decide making its revision public (respectively ‘to push’) at any point of time.

The integration manager then gets informed, pulls all changed developer repositories and decides which changes will be added, removed or merged into the blessed repository as an official release. Developers in return can appropriately pull the release including all work results made by others back into their repository.

Dictator and Lieutenants Workflow
In large open source projects with several hundreds of collaborators like Linux an extended workflow of the integration manager workflow takes in place:

Here each software part is in charge by an integration manager acting as a lieutenant for a team of developers. Each lieutenant pushes their part to the top integration manager controlling as a dictator of the overall software development.

Development in the Small and Large

In software design we make differences by modules being design in the large and objects being design in the small. In software development we separate workflows as work in the large and development tasks as work in the small. A repository must fulfill both constraints.

During daily development a collaborator does these tasks within their local repository:

Make changes (commit work results as revisions)
Compares previous changes with the current source (revision history)
Revert and undo changes to previous changes (revision rollback)
Switch to different releases or repositories for a development and production system (branched repositories)

As a summary it is clear that a repository plays a key role as a system for controlling the workflow in the overall development process and as a tool in the daily work of each collaborator.

Local, Central and Distributed Repositories

All these workflows are outlined by the different types of repositories:

a central repository for managing official software releases and deliveries,
a distributed repository for managing different software parts, local teams or individuals and
a local repository (or file system) for managing each collaborators work.

Tooling the development process

A central repository – nomen est omen – is static and provides less flexibility on both levels:

In the workflow it is a single-point of access and failure and
The work in the small is done only on a file system and synchronized with the central repository.

Git and GitHub

Git is an open source SCM tool being developed by Linus Torvalds. The features and supported workflows are made by best practices from the open source community. Though Git are mainly used for distributed repositories it also allows to setup as a central repository. In short Git has the same features compared to central SCM tools but without their constraints supporting:

A workflow with a public and less central repository (push-and-pull principle), along with a
Local Git repository for individual flexibility and independency (no single-point of failure or internet access required).

Why does the Linux community do not use a central SCM tool like subversion? As Linus Torvalds says:

“The slogan of Subversion for a while was ‘CVS done right’. …
There is no way to do CVS right.”, Linus Torvalds

He complaints the lack of key features required for open source development:

De-centralized workflow: Developers must work independent from each other and on different branches. It must work locally without relying on a central repository. The tool must be fast and efficient on merging changes or different branches.
Data safety and Integrity: The data must be safe and reproducible – even after years of changes. It must be protected against corruption and manipulation. All tasks including data integrity must be verified locally.
Performance: open source developers spent their valuable time and prefer fast and efficient tools. The repository must be fast and remote connections to other system must be as minimal as possible.

Git basics

For creating a Git repository the following command is required:
git init

The command must be executed on the root’s local project folder and creates a sub directory ‘.git’. This is the Git repository and contains all necessary repository files.

Files created and added in sub directories can then be committed to the Git repository:
git add /src/MyClass.java /src/MyOtherClass.java git commit –m "comment"

While working a developer can add and commit files at any time.

Git internals

With every commit data are stored into the git repository with a specific structure:

Commit object: This is the pointer object with a reference to the root tree object (e.g. for the src folder). It also references the commit objects being made before and afterwards.
Tree objects: The committed files (in a blob format) are reference to a tree object. There is at least one root tree object. In case of files in different folders for each sub folder a sub tree object is created referencing to each other.
Blob objects: the committed files compressed in an efficient blob format.

The commit, tree and blob objects represent a snapshot of committed changes. A snapshot can be seen as a copy of the local files and directory but containing only the changes and not all files.
It is important to understand that the file content is created for the blob and not the file itself. Subversion on the contrary makes a complete copy of the whole tree into a new branch.

Since Git’s blob objects only contain contents it is even possible to do granular commits like single lines of a source code.

Based on this content a unique SHA-1 key is generated. In case only a folder is renamed, or a file is moved to another folder and committed then no new blob objects are created. The new tree object references to the existing blob objects by identifying the hash key.

This keeps the Git repository compact and allows fast tasks. Like switching from one branch to another is very performant.

The commit hash key represents the revision version. For human readability it is possible to tag a commit:
git tag -a v2.0RC1 -m "2.0 Release Candidate 1"
In Git the latest commit is marked with a master pointer representing a branch. With every commit the master pointer is automatically moved to the next commit object:

Branch pointing into the commit data's history

Branches are used for defining different development streams like separation of production, maintenance and development releases.

Besides the master branch additional branches can be created on any commit object:
git branch testing

Internally a new branch object is created:

Multiple branches pointing into the commit's data history

In addition to that there is also a head pointer referencing to the branch the developer is working on:

HEAD moves to another branch on a checkout

Commits can be done on any branch. As an example a developer can continue working on the master as the latest development branch while another developer is working on a production or testing release:

GitHub

In 2007 the company GitHub, Inc was founded and provides a hosting solution since 2008 as a Software as a Service. It is the most popular open source hosting site and has surpassed Sourceforge and Google Code in popularity. Groups with a GitHub repository include Ruby, Erlang, Eclipse, JBoss, Spring Framework, Twitter, Microsoft Windows Azure and Mozilla.

Social Coding

It is a social coding platform to share code with friends, co-workers, classmates and complete strangers. Hosted repositories like Sourceforge have the focus on a project’s repository. GitHub emphasizes the user with their repositories.

On GitHub users are encourage to propagate their projects or to use other projects by forking a repository. Forking a repository basically splits a repository to a new user’s cloned repository. Later the user of a forked repository can make a pull request to the user to merge his work results back into the original repository.

Like a social network on GitHub it is possible to watch and follow other users. Another feature for enforcing workflow and communication of users is their integrated issue tracker. Each issue can be assigned to a user and supports communication via email. It is also possible to make comments on commits in any branch and refer this to other users. These comments have Email support allowing users to directly pinpoint a problem or issue for a commit and communicate with each other.

Tai Truong, Technology Evangelist