Project:
Thinklab Meta [meta]

File hosting advice


As I've discussed with @jspauld, I think a ThinkLab hosted storage solution would be really nice to ensure files are persistent. Each file could even be reachable as a subdomain of the master doi. However, I also see a potential overstretch problem when other players already may provide good storage and embedding solutions.

I would like people's advice on good places to host different types of content. Ideally, the hosting would be free, allow commercial reuse/embedding, and persistent.

ThinkLab definitely could host files. And it may be that we need to do so in order to allow users to embed figures and make them look pretty. That said, its my strong intention to have ThinkLab play nice with as many third party services as possible.

Figshare seems to be the go-to site for hosting of scientific files. There is some question as to whether or not they will allow their users to embed content on ThinkLab. But I guess worst case scenario you can just link people to view the content directly on Figshare.

One possibility is to start off using figshare. The embed markdown could mirror the current youtube and vimeo syntax: ![:figshare](937004)

Since the content users upload to figshare should be CC-BY or CC-0, it could always just be copied and stored locally by ThinkLab, if figshare tries to charge an exorbitant fee.

Hosting content with GitHub

git is a distributed version control system that stores files and tracks their changes over time. While primarily designed for code, git repositories can contain any type of files, including binaries (an area where the main competing project, mercurial, struggles). This cheat sheet goes over the commands for initializing and updating a repository.

GitHub is an online host for git repositories with a nice user interface to browse and inspect the contents. I have started hosting our project analyses using GitHub and find that it solves many of the file hosting issues that I was experiencing.

First you must create a repository, add your files, commit, and push to GitHub. For this post, I'll use my erc repository as an example. Below I show how to use GitHub for cloud-based file storage.

Retrieving the most recent version of a file

format: https://raw.githubusercontent.com/[user]/[repo]/[branch]/[path_to_file]
example: https://raw.githubusercontent.com/dhimmel/erc/gh-pages/entrez-group.R

Linking to the most recent version of a file makes sense for a research plan or project summary, where the updated file is desired. For example, you may have a figure that should be updated as you modify your analysis.

Retrieving a specific version of a file

format: https://raw.githubusercontent.com/[user]/[repo]/[commit_hash]/[path_to_file]

By linking to a specific file in a specific commit, you don't have to worry about file updates or deletion interfering with the content of an old post. This makes sense for discussions where posts are often chronological. In these instances, updated external resources could degrade the scientific record.

Advantages

The main benefits of this system are tracked-changes, versioning, and that code, data, and results are all coupled.

Drawbacks

  • file sizes are limited to 100 MB
  • requires git expertise, but the GitHub applications for windows and mac reduce the barrier
  • Antoine Lizee: To be noted: there is a utility to visualize html files from github repositories: http://htmlpreview.github.io/ I found it very useful to share reports that are compiled directly in the projects (using R Markdown for instance).

To be noted: there is a utility to visualize html files from github repositories: http://htmlpreview.github.io/ I found it very useful to share reports that are compiled directly in the projects (using R Markdown for instance).

@alizee, I have been using https://rawgit.com to display webpages corresponding to a past commit, since the gh-pages branch method only displays the current version.

I like the idea of htmlpreview.github.io because it is itself a github page, meaning GitHub remains the only dependency. However, I cannot get this site to work:

See: htmlpreview hangs while rawgit delivers.

@alizee, I have been using https://rawgit.com to display webpages corresponding to a past commit, since the gh-pages branch method only displays the current version.

I'm not sure I understand, since as you said:

https://raw.githubusercontent.com/[user]/[repo]/[commit_hash]/[path_to_file]

...lets you get the file at a certain point in history. So you can preview an html file through https://htmlpreview.github.io/?https://raw.githubusercontent.com/[user]/[repo]/[commit]/[path]

I just wanted to add that if you're not creating these pages or links programmatically, the easiest way to get the link pointing to the raw content (eg: html) is to navigate to your file/commit on github and click on the "Raw" button on the top right corner. You can then grab the link and copy/paste it directly into https://htmlpreview.github.io/ for instance. I found it less error prone than trying to construct the link itself.

The raw.githubusercontent.com method can retrieve an html file from a previous commit (example). Although the html file is retrieved, it is not displayed as a webpage (at least in chrome). Therefore, I have been using rawgit.com (example) but may switch to htmlpreview.github.io (example) now that the mixed content bug has been fixed.

 
Views
120
Topics
Cite this as
Daniel Himmelstein, Jesse Spaulding, Antoine Lizee (2015) File hosting advice. Thinklab. doi:10.15363/thinklab.d27
License

Creative Commons License

Share