Reproducibility: Data Management, git, RStudio

I am please to provide this early-announcement and philosophical musing on a future workshop coming this Fall. In this workshop we will demonstrate and provide a hands-on option for quickly learning a best-practical approach to reproducible data and research-project hygiene.

Stay tuned to our Workshop Calendar for dates and times. Or get notified through our official announcement channels like dvs-annouce and[ @duke_data](http://twitter.com/duke_data) …

In order to best describe the goal of this new workshop let me back up a step. One of the newest development in my Department (DVS) is the addition of two full-time experts and a CLIR fellow all focused on Research Data Management. We are lucky to have these three experts help us face the reproducibility crisis 1 more intelligently. With their help, scholars can attend to a collective goal of documenting the research cycle as a project life-cycle. This does mean change as we crawl through a transition away from common past practice and towards a more datafied future with new routines and modernized best practices.

As our data processing techniques continue a steady evolution we want to maintain momentum. To do this we accept reproducibility as a common good. Without the adoption of this common approach I will admit to occasionally worry about a dark-ages phenomena which can threaten early-phase digital objects and render recent-past digital information obscure or invisible. This lapse of information transparency seems plausible as society undergoes qualitative leaps in the adoptive pace of newer “publication” technologies and infrastructures. Fortunately reproducible solutions can counteract the plausible concern.

Of course much of the focus on the crisis has to do with documenting past processes prior to a fixed publication point. To do this we must all learn to exploit newer methods as we move away from analog fixity and towards an Everything is Miscenaneous 2 data-digital world, from digital ephemera to managed datafication. Adopting these new tools and methods efficiently means being exposed to and leveraging existing tools for code and data repositories (e.g. git, gitlab, github, etc.) as well as integrating work processes into a scholarly commons which efficiently documents the entire research cycle (e.g. OSF 3.)

Okay and but, in practice I sometimes find the use of the most sophisticated repository tools are confusing and even off-putting. The confusion leads to capitulation through exhaustion. I may, for example, choose to use less sophisticated file replication sharing tools (e.g. box, dropbox, onefile) as if they are repositories. And while these file sharing tools are nice and clearly better than nothing, the truth is you can easily reach a higher standard. That’s what this future workshop will promote: the use of practical repository tools which you can integrate into your evolving practices.

… practical use of best-practice repository tools for your evolving data practices.

Our specific focus will be the use of git through Rstudio, as well as how OSF can be used to make collaborative research processes more efficient. Our aim is to quickly demystify processes and skip past the too-easily promoted – but also too-often numbing – tomes of documentation (which demi-gods have effectively used as punishment for our warped sense of inconvenience).

Resources

Some of the resources which will guide a future curriculum include:


  1. Richard Price. Wired. Jan. 2017. http://www.wired.co.uk/article/science-academic-papers-review

  2. David Weinberger – 2007

  3. Open Science Framework. https://osf.io/