Social coding¶

Learning outcomes of 'Social coding'

Learners

have an overview of motivations, benefits, but also risks of sharing and reusing code.
How to start a git repo from existing code project?

Instructor notes

Prerequisites are:

Git

Lesson Plan:

Total 45 min
Social coding 25
- Start by not showing screen
- ask questions
Briefly on licenses
Repo initialization
- Principles 5m
- Exercises 15 min

Note

This material is based on the Social Coding lecture by Code Refinery:
Social coding by CodeRefinery is licensed under CC BY 4.0.
The Open Science movement encourages researchers to share research output beyond the contents of a published academic article (and possibly supplementary information).
Open-source license is a type of license for computer software and other products that allows the source code, blueprint or design to be used, modified and/or shared under defined terms and conditions.

FAIR

The current buzzword for data management

You may be asked about it in, for example, making data management plans for grants:
Findable
- Will anyone else know that your data exists?
- Solutions: put it in a standard repository, or at least a description of the data. Get a digital object identifier (DOI).
Accessible
- Once someone knows that the data exists, can they get it?
- Usually solved by being in a repository, but for non-open data, may require more procedures.
Inter-operable
- Is your data in a format that can be used by others, like csv instead of PDF?
- Or better than csv. Example: 5-star open data
Reusable
- Is there a license allowing others to re-use?

Opening discussions¶

Info

Choose one or several!

1: Why would I want to share my scripts/code/data?

A: Easier to find and reproduce (scientific reproducibility)
B: More trustworthy: others can verify correctness and find and report bugs
C: Enables others to build on top of your code (derivative work, provided the license allows it)
D: Others can submit features/improvements
E: Others can help fixing bugs
F: Many tools and apps are free for open source, so no financial cost for this (GitHub, GitLab, Appveyor, Read the Docs)
G: Good for your CV: you can show what you have built
H: Discourages competitors. If others can't build on your work, they will make competing work
I: When publicly shared, usually we timestamp or set a version, so it is easier to refer to a specific version
J: You can reuse your own code later after change of job or affiliation
K: It encourages me to code properly from the start

2: The most concerning thing for me, If I share my software now

A: It will be scooped (stolen) by someone else
B: It will expose my "ugly code"
C: Others may find bugs and mistakes. What if the algorithm is wrong?
D: I will get too many questions, I do not have time for that
E: Losing control over the direction of the project
F: Low quality copies will appear
G: I won't be able to sell this later. Someone else will make money from it
H: It is too early, I am just prototyping, I will write version to distribute later
I: Worried about licensing and legal matters, as they are very complicated

Citation as one form of academic credit to motivate sharing papers.

Sharing papers and academic credit:

The goal is maximum visibility and maximum reuse.
The more interesting science is done referencing my paper, the better for me.
Nobody actively tries to limit the reach of their papers.

Different ways we can benefit from sharing code.

Sharing code:

"I did all the ground work and they get to do the interesting science?"
Sharing code and encouraging derivative work may boost your academic impact.
But will your work be visible if it is used two levels deep down?

From Science editorial policy

"We require that all computer code used for modeling and/or data analysis that is not commercially available be deposited in a publicly accessible repository upon publication. In rare exceptional cases where security concerns or competing commercial interests pose a conflict, code-sharing arrangements that still facilitate reproduction of the work should be discussed with your Editor no later than the revision stage."

From Nature editorial policy

"An inherent principle of publication is that others should be able to replicate and build upon the authors' published claims. A condition of publication in a Nature Research journal is that authors are required to make materials, data, code, and associated protocols promptly available to readers without undue qualifications. Any restrictions on the availability of materials or information must be disclosed to the editors at the time of submission. Any restrictions must also be disclosed in the submitted manuscript."

However a study showed that despite these policies, many people still do not share their code 😞.

Motivation for open source software

Enable derivative work
Do not lock yourself out of own code
Attract developers who want to be able to show the coding work on their CVs
Tightly regulated domains require open source
Open-source software (OSS) can lead to more engagement from industry which may lead to more impact
If it's not open, it is not likely to become standard

Sharing software is also scary. Why? (And solutions)

Fear of being scooped

A license can avoid it, and you can release when you are ready. Anyway, it is very unlikely that others will understand your code and publish before you without involving you in a collaboration. Sharing is a form of publishing.
Exposes possibly "ugly code"

In practice almost nobody will judge the quality of your code. "Software, once written, is never really finished" (N. Asparouhova).
Others may find bugs and mistakes

Isn't this good? Would you not like to use a code which gives people the chance to locate bugs? If you don't release, people will assume there are bugs anyway.
Others may require support and ask too many questions

This can become a problem: use tools and community and protect your time. You aren't required to support anyone. You can also "archive" a repository to disable most forms of interaction (issues, PRs). Also a note in README on support level helps.
Fear of losing control over the direction of the project

Open source does not mean everybody can change your version.
"Bad" derivative projects may appear

It will be clear which is the official version.

Code reusability¶

Should you reuse things that others have done?

Types of things that can be reused:

Main libraries (e.g. NumPy, SciPy)
Special scientific libraries
Random code from website
Copying from Stack Overflow

Do you want others to reuse what you make?

Whether and what we can share depends on how we obtained the components.

Our work depends on outputs from others. Research of others depends on our outputs.
Whether you can share your output depends on how you obtained your input.
A repository that is private today might become public one day.
Sometimes "OTHERS" are you yourself in the future in a different group/job.
Software licenses matter. And this is what we will discuss the last day.

https://coderefinery.github.io/social-coding/sharing-data/

The Turing Way

The Turing Way is an open science, open collaboration, and community-driven project.
We involve and support a diverse community of contributors to make data science accessible, comprehensible and effective for everyone.
Our goal is to provide all the information that researchers, data scientists, software engineers, policymakers, and other practitioners in academia, industry, government and the public sector need to ensure that the projects they work on are easy to reproduce and reuse.
The Turing Way Handbook

Licenses¶

Copyright: Protects creative expression:
- software,
- writing,
- graphics,
- photos,
- certain data sets,
- this presentation.
- Practically “forever” (lifetime of author + 70 years).
Derivative work: Sampling/remixing

license-models

European Commission, Directorate-General for Informatics, Schmitz, P., European Union Public Licence (EUPL): guidelines July 2021, Publications Office, 2021, https://data.europa.eu/doi/10.2799/77160

Comments on the taxonomy:

Arrows represent compatibility (A -> B: B can reuse A)
Proprietary/custom: Derivative work typically not possible (no arrow goes from proprietary to open)
Permissive: Derivative work does not have to be shared
Copyleft/reciprocal: Derivative work must be made available under the same license terms
NC (non-commercial) and ND (non-derivative) exist for data licenses but not really for software licenses

When to add license?

Early (more complicated to change it later when already public)
Work as if it was public for beginning!

How to choose license?

Your code is derivative work if you have started from an existing code and made changes to it or if you incorporated an existing code into your code.
- If your code is derivative work, then you need to check the license of the original code.
From "scratch"
- Does your work contract, grant, or collaboration agreement dictate a specific license?
- Is there an intent to commercialize the code?
- When there is unknown or mixed ownership: If there are multiple persons or organizations as owners of the code, all must agree to the license.

Want to learn more?

Software and licensing lesson by Code Refinery

Our project¶

We use GPL-3 in the project

Strong copyleft share-alike (GPL, AGPL) Derivative work is free software and derivative work extends to the combined project If the licenses of components are strong copyleft, one must use the same license

We can click on the license and a image will also show up!
- LICENSE
How does that look like?

Start a Git/GitHub repo from personal existing project¶

Many projects/scripts start as something for personal use, but expands to be distributed.
Let's start in that end and be prepared!

Principle¶

Initiate git project
- Browse to right root directory (the folder containing all the project-related files)
Stage and commit
upload to github

(Optional) Exercise 2: 10-15 minutes¶

Let's say you have some code you have started to work with

Tip

Work individually locally (in VS Code or terminal)
Help each-other if getting stuck
Start with 1A OR 1B
- 1a goes to Breakout room 1
- 1b goes to Breakout room 2

Exercise 1A: Identify existing project

Just use an existing programming project you already have
Browse to right root directory (the folder containing all the project-related files)

Exercise 1B: Make a code base for a new test project

Make a test_project directory in a good place (like a local Programming formalisms course folder)

In VS Code?

Make a new window
Open Folder
Create new Folder with name test_project
Select folder
Create and save a file hello.py with the following code base and the in-code documentation answering the question "why".

# We just want some output from a simple program
print('Hello world!')

Exercise 2: Initiate the project

VS CODE

initialize_VSC

RECOMMENDED Publish to GitHub diectly and you are done!
- You may change the name of the repo for the GitHub instance, but not recommended.
- Include the file(s) (in this case the hello.py file) in the repo!
- Double check it was created on GitHub!
  - It should show up under repos in your user space
ALTERNATIVE: Initialize and then continue with step 3.

Terminal

Be in a terminal and go to the project folder, which will be the project repository (repo)
run git init
make sure that there is a .git directory created
- you have to show hidden files, in bash terminal with ls -a
Now you have a git repo called test_project
Check with the command: git status
- It is always a safe command to run and in general a good idea to do when you are trying to figure out what to do next.

(If needed) Exercise 3: Add and commit the content

So far, there is no content. We have to manually add the content to the repo.
Add and Commit your changes

VS Code

We do this all the time! :)

Terminal

git add
git commit -m 'first commit'

(If needed) Exercise 4: Upload to GitHub

In VS Code

There is an opportunity to directly publish on GitHub

From GitHub

Make sure that you are logged into GitHub.
You can use this for both VS Code and terminal

New repo

To create a repository we either click the green button "New" (top right corner).
Or if you see your profile page, there is a "+" menu (top right corner).

New top-right

On this page choose a project name, e.g. test_project or a project name suiting your existing project.
NOTE It is not necessary to have the same name but it makes things easier to know what is what when syncing between GitHub and git.
For the sake of this exercise do NOT select "Initialize this repository with a README"
and NO Licence

New repo

Example project

Press "Create repository"

Create and push

Choose html
Copy-paste the code for "…or push an existing repository from the command line"
Go to local git terminal and go to the git project you started above
Paste the code
Did it work??
Reload the GitHub page and see the files present locally is also present there.

Done!

What we did¶

Workflow

graph TB

P["Project idea"] -->|git init| Node2
P["Project idea"] --> hello.py -->|git add| Node4
Node4 --> |git commit| Node1
Node2 --> |git push| Node5

%% C[Uncommited changed hello.py] -->|commit button| R
R <--> Node5
       subgraph "Local Git"
        Node2[project]
        Node1[hello.py]
        Node1 <--> Node2

        end

        subgraph "staging area"
        Node4[hello.py]
        end

        subgraph "GitHub"
        Node5[project]
        R[hello.py]
        end

About releases

About releases on GitHub