1.5 Version Control & GitHub#

Version Control#

Version Control is a system that organize and tracks the version of codes.

../_images/phdcomics_finaldoc.gif

Fig. 1 Versions of a research paper. File nameing and tracked changes can become messy. From PhD Comics.#

It keeps track of the changes and who made the changes. See further discussion on The Turing Way [Community, 2022]. Here are the main points.

An illustration of a main branch

Fig. 2 An illustration of a main branch from The Turing Way 1.#

At each commit, Git will track the changes. Several people can work on the same file, and version control will recognize conflict and provides option to resolve it. It is good practice to make the main branch the cleanest branches. Create additional branches to work on specific parts of the project.

An illustration of a sub branch that gets created, committed, and merged.

Fig. 3 An illustration of multiple sub branches that get created, committed, and merged to the main branch. From The Turing Way 1.#

What is the difference between git and GitHub?

git is a software: You can use it to track changes only on your computer GitHub is a platform that can host your repository. There are other platforms like Bitbucket, GitLab.

Set up Git#

You need to set up your user name and your e-mail

    > git config --global user.name "superseismo"
    > git config --global user.email "superseismo@uw.edu"

Use the same user name as your GitHub account. On GitHub, you will see who has made the modifications to the repository.

You need to do it only once on your computer.

GitHub#

Create a GitHub account#

Create a GitHub account using the same user name and email address as in your git config. Free accounts make you have all of your repository open to the public. Pro (for a fee) accounts allow you to keep repositories private.

Use the GitHub desktop App#

GitHub has a Command Line Interface to use with the Terminal. Check out the Github CLI Cheat Sheet.

GitHub has a GUI desktop app for all OS, including MacOSX and Windows. The desktop app handles the CLI in a user friendly API (Application Programming Interface). The workflow and visualization are really friendly and it is recommended to use from the laptop environment. In JupyterHubs from remote servers, it is still necessary to use CLI.

Notes on GitHub enhanced authentication#

GitHub offers 3 ways to enhance authentication. Download the Android/iOS GitHub Mobile app. Just like Duo, GitHub will use the phone for additional layers of authentication.

  • Username + Password with two-factor authentication (2FA), which requires the GitHub Mobile App.

  • Personal Access Token: Click on the upper right account -> Settings -> Developer settings (left sidebar) -> Personal access tokens (left sidebar). Then generate new token, add a descriptive name, give it an expiration (or not), grant specific permission to the token, Generate token, copy using the clipboard icon, save into a local text file. Keep the file somewhere on local, inaccessible from other cloud storage (e.g., Dropbox or Gdrive). The token replaces the password.

  • SSH keys. Create new SSH keys. See an example on how to set up the key here.

More details here on GitHub authentication.

Use an existing Repository#

You can create 2 copies of a repository:

  • A Fork (top right click) creates a totally independent copy of the repository. You can modify code as needed and this will not change the original repository

  • A Clone (also top bar) creates a copy of the original repository, but is linked to the original repository that will continue to be synchronized. Do so with the following commands: git clone “https://github.com/superseismo/example.git

Create a new Repository#

You are going to create your first repository. Congrats! Go on the browser, In the upper-right corner of any page, use the drop-down menu, and select New repository. It will ask you a series of default settings.

We briefly go over some best practices inspired by Nenadic et al. [2022].

Good repository content are:

  • README.md: Center piece of the docmumentation. Markdown/text file that describes the purpose of the repository and contain some explanation of what this repository is about. It also contains any information that the contributor sees as important: installation guide, licensing, contribution gudelines, code of conduct, citation, badges and links to Colab or Binder notebooks, … See Software Carpentries Intermediate REsearch Software Development in Python for more details. Find a list of repositories with great readme here. Some of the line items to add:

    • Installation/Deployment

    • Basic Usage: step-by-step instructions to use the software

    • Contributing: detail the type of contributions welcome

    • Contact information / getting help: guidelines on how to get in touch

    • Credits/Acknowledgement: Specific the academic publication to cite (e.g., add a bibtex text in a code box like this):

       @software{MLGeo, 
       authors = {The GeoSMART team}, 
       year = 2022, 
       doi = {Not yet}
       }
      
    • License: a short description, or automatically generated license. See below.

  • License: Critical: without a license no one can reuse and modify the software. Find a detailed description of the various software vs data license in the Turing Way. Choosealicense is a nice guide to help you find the appropriate license.

  • Fix the virtual environment for users to be able to re-use the software. Add an environment file environment.yml or requirements.txt

To add files to your repository:

 git add mycode.py
 git commit -m "upload a slick python code"
 git status
 git push

(*) Everytime you start working on your repository, make sure you have the up-to-date version on your local environment. In CLI, this means:

 git pull

To modify mycode.py, use the command:

 git checkout mycode.py

Note

for the MLGeo2022 course, you will have 2 repositories: - create your own repository under the UW-ESS-DS GitHub organization under the name: MLGEO2023_UWNETID. - the course repository, please clone: git clone “https://github.com/UW-ESS-DS/MLGeo-2023” Then copy the environment.yml file from the MLGeo2023 course into your own repository.

  1. From the browser:

    • Connect to your GitHub account

    • On the top right of your homepage, click "+" and New repository and choose a name.

    • Check “Add a README.md

    • Choose a license

    • Create repository

    • Clone the repository to your local.

  2. From the GitHub Desktop app:

    • Click Add on the top of the left sidebar

    • Click Create new repository and choose: Name, short description, local path (avoid a home directory on other cloud (Dropbox or Github) to reduce headaches), initialize with a README, choose license, choose Git Ignore (most likely programming language)

    • Open in Visual Studio Code or other preferred editor.

    • Check that the GitHub page is up to date

    • Fetch, Push, Pull, etc

  3. From the CLI:

    • Initialize a local directory as a git repository:

        git init
        git add *
        git commit -m "my first commit"
        git push
      
    • Be aware of the need to use passwords or tokens. Your configuration may also be incomplete, so re run configuration listed above.

    • You can now start writing codes and documentation. GitHub uses staging as terminology to keep track of the new changes that could be committed to the remove server. To add the new script:

        git add newfile.py
        git commit -m "add new script"
        git push
      
    • If you made some code modification but prefer using the version that is on the GitHub, you can unstage and reset the file from the last version:

        git reset HEAD newfile.py
        git checkout newfile.py
      
    • To do the same for the entire repos:

        git reset HEAD~
        git checkout
        git status
      

Work as a team with GitHub#

The main branch should remain the clean, official version for the public.

  • Use Pull requests to allow others to suggest code changes to a repos. The pull request tracks line-by-line changes and allow the repository contributors. Here is the workflow: - Create a fork of the repository - Clone the fork to the local computer - When done, add+commit+push from local to remote repository. The fork has the new changes. - Pull request from the browser original repository. Make sure that you mention specific colleagues to analyze your submission (use @their-github-name) to notify them. - look at the diff between the two repository. - the repository contributors/owners will accept or suggest additional modification before accepting. - Once accepted, your changes will be part of the main repository.

../_images/overview-github-collaboration.png

Fig. 4 Pull requests using GitHub: Found in EarthDataScience. Source: Earth Lab, Alana Faller#

  • Use GitHub Issues to post bugs or performance issues, so that the contributors can keep track and address them. There are templates to posting issues, and online discussions about it. The main takeways are: - Avoid duplication, check if somebody else has had the same issue - Use template

../_images/GitHubIssue.png

Fig. 5 GitHub Issues: Use template#

Further discussions. here

Publish your software#

If the software will be used for future research and would be cited by the community, publish your software on Zenodo. GitHub provides guidelines here.

Note

Note for MLGeo 2022 cohort

The course repository is here. Clone it to your jupyterlab environment:

    git clone "https://github.com/UW-ESS-DS/MLGeo-Autumn22"
    cd MLGeo-Autmn22
    git status

To force-reset the repository from the main branch git reset –hard origin/main git pull

Copy the homeworks into your own repository (MLGEO2022_UWNETID). You can play with the tutorials in the MLGeo-Autumn22, but if there is a hard reset it will overwrite your notebook.

  • Use your own repository to keep your own work.

  • Push your repository to the main branch after your session so that the instructor can see the changes.

Handling .gitignore to avoid having large files pushed to remote server.


1(1,2)

[Community, 2022]