Discussion summary statistics for illustrating project impact

Daniel Himmelstein, Antoine Lizee, Jesse Spaulding

doi:10.15363/thinklab.d191

Discussion summary statistics for illustrating project impact

Daniel Himmelstein March 26, 2016

I'm currently writing up a first draft of our project report. Yay! I currently have a sentence:

In total, XX non-team member commented across the 65 discussions, which generated XX comments totaling XX words and XX characters.

Essentially, I want to summarize how much content, participation, and viewership our project has generated. @jspauld, what's the best way to go about computing these numbers (I can do it manually but is there an easier way)?

Does it make sense to add these summary statistics somewhere on the project page?

And is it possible to get a total count of all views/unique viewers to any page in the project?

Daniel Himmelstein April 1, 2016

One way to proceed would be to create a feature for exporting a Thinklab project. From the export, I could calculate all the stats I want.

For example, I'm imagining a JSON export (or XML) that contains:

the markdown content for discussions and documents
the HTML-formatted content for discussions and documents
user info for the subset of users who contributed, i.e. the project leaderboard

Potentially, the export could contain multiple files that get zipped into a single download. Then the export could include figures and potentially each discussion would be in its own file in a discussion directory.

Antoine Lizee April 5, 2016

That would be extremely useful indeed, and I side with Daniel on getting raw data.

From an engineering point-of-view - because it makes explaining easier - it would be a good start in my opinion to get JSON or tabular format of a subset of fields for the 'comments', 'discussions', and 'users' collections.
Scoping per project would be great. HTML formatted content seems non-necessary for descriptive stats and potentially harder to get, and even the actual contents for each object could be skipped to start with. Date, user, number of views, and the few foreign & primary keys would be a great start!

Jesse Spaulding April 7, 2016

I setup a project export. It is available here: https://think-lab.github.io/p/rephetio/export. Note number of views is a calculated field and thus I prefer not to add it to the export for now.

It currently contains the following tables and fields for each project:

documents — title, intro_md, intro_html, body_md, body_html, doc_published, topic_field
threads — profile, document, subject, published, topic_field, doi_field
comments — profile, thread, body_md, body_html, published
notes — profile, comment, body_md, body_html, added
profiles — username, first_name, last_name

Daniel Himmelstein: In threads, it seems doi_field refers to whether the discussion is regarding a specific paper. However, it currently only holds 3 values — {'', '10.1093/nar/gkv1075', None} — so it looks like there is a bug. Also can we get another column with the DOI for the discussion?
Jesse Spaulding: Yes, there was an issue. It has been corrected now. DOI column added.

Daniel Himmelstein April 8, 2016

Fantastic — can't wait to play with the content.

I'm having a bit of trouble reading the file. The two methods that I usually use to download and load a JSON from Python are not working (JSONDecodeError):

url = 'https://think-lab.github.io/p/rephetio/export'

# Method 1 (Python 3)
import urllib.request
urllib.request.urlretrieve(url)
with open('export') as fp:
    export = json.load(fp)

# Method 2 (requests dependency)
import requests
export = json.loads(requests.get(url).text)

I think it's because they are retrieving an HTTPMessage rather than a JSON text file. Also, when I view the project export in my browser, I'm noticing some escaping that I don't think is standard JSON. Here's an example URL that does work https://pages.github.com/versions.json. @jspauld any ideas?

Two other points — if this is a JSON file, then adding a .json extension may be helpful. Second, I think newlines make working with JSON files easier. Setting indent=2 in Python's json.dump will enable newlines. Newlines often make the difference because a frozen and responsive text editor.

Jesse Spaulding April 8, 2016

Yes, I thought something looked off. I was serializing things twice. Try again with:

https://think-lab.github.io/p/rephetio/export.json

Antoine Lizee April 8, 2016

Works for me - awesome!

We were thinking of potentially applying the simple analytics we're doing to the whole thinklab dataset. Would there be a way to get the data for all projects?

Also, even if calculated, hence dynamic and potentially unreliable, the 'views' field would be a great addition for marketing purposes of our projects. We can issue a disclaimer next to the related data.

Best,
Antoine

Daniel Himmelstein: I agree the visitor counts would be great. I'm already learning from our project's discussion view counts. And if we set up an automatic export every week, we will be able to do fascinating analytics of interest in subjects over time.

Daniel Himmelstein April 8, 2016

Automatically retrieving the Thinklab JSON export

Here's a Python script to retrieve, parse, and save the JSON export. You need to define the email and password variables for your Thinklab account.

import json
import datetime
import requests

with requests.Session() as session:
    login_url = 'https://think-lab.github.io/login'
    session.get(login_url)
    csrf_token = session.cookies['csrftoken']
    payload = {
        'email': email,
        'password': password,
        'csrfmiddlewaretoken': csrf_token,
    }
    session.post(login_url, data=payload)

    export_url = 'https://think-lab.github.io/p/rephetio/export.json'
    response = session.get(export_url)
    export = response.json()

export['retrieved'] = datetime.datetime.utcnow().isoformat() + 'Z'

with open('export.json', 'wt') as write_file:
    json.dump(export, write_file, ensure_ascii=False, indent=2, sort_keys=True)

I added a retrieved property with the date/time the export was retrieved. I'd like to acknowledge these two stack overflow answers for help with logging in and CSRF tokens in requests.

Daniel Himmelstein April 8, 2016

A repository for project analytics

I created a repository (thinklytics) for exporting and analyzing Thinklab content.

@alizee, I think you want data for all projects and proposals using my export.py. The only missing piece to programmatic retrieval of all content is a complete list of proposal and project IDs. For now this can be compiled manually, but it would be nice to have a think-lab.github.io/projects.json for getting this list. Consider forking thinklytics.

Antoine Lizee April 9, 2016

Hi Dan - thanks for the script & the repo.

I updated the code and the syntax of the command line. Noticeable changes are (i) specification of the directory instead of the path of the exported file, (ii) specification of username and password as arguments for enhanced security (iii) automatic scraping of all projects and proposals with the 'all' keyword. (iv) adapted defaults.

I realized later on that (ii) was kind of pointless since there is no https enabled for Thinklab. @jspauld, do you have any plan on that ?

Antoine Lizee April 10, 2016

First results

I finished the first pass of the analytics using R, resulting in a few graphs.

I had to re-organize a little the repo, so the export script is now in a subdirectory.

Project imbalance.

A few project clearly see more activity than others. Proposals suffer from the fact that they seem to generate less posts (comment / thread / note).

Number of posts per project

Evolution of contribution

The contributions have been quite irregular in term of timeline but are on the rise. 2016 has seen a clear diversification of projects in addition to the increase of activity, with 6-7 projects showing significant contributions.

Evolution of post creations

Counting the number of character generated in comments and notes, we see a slightly different story where some content-heavy projects are overly represented. Also, reducing the 'bandwith' of our temporal analysis, we can see finer details of the activity, which underlines its irregularity, especially during 2015.

Evolution of character generation

Individual contributions

Finally, we can see that beyond @dhimmel and @jspauld, we have a few "power users" that have joined the platform and contributed at different stages. Thinklab currently hosts 44 contributors that have written more than 500 characters, 18 of them (highlighted below) having written more than 4,000.

Contribution of individual profiles

Limitations

These results should be taken with a grain of salt, as we face a few limitations. First, we have static information about our objects, with no tracking of edition. As a result, we are underestimating activity in general, and for proposals in particular. Second, I am not sure that comment and note objects provide a complete inclusion of the material that is generated on Thinklab. There might be other object types that we don't have access to, related to intro & reports of projects, and grant proposals.

Jesse Spaulding April 11, 2016

@alizee Cool charts!

There is now a new version of the export. It's slightly different as we're now using Django REST framework. I've included the view count and DOI for documents and threads. Note that my concerns regarding the views being a "calculated field" had to do with the performance impact of having to calculate the value for all threads on each export. However, for now it's not a problem.

Daniel Himmelstein: I like that the new version removed an unnecessary nesting. However, the primary keys for each item are now missing. Can these be added to an identifier key?

Antoine Lizee April 12, 2016

Thanks @jspauld.
We've not tried yet, but the REST format for the json should be fine, a long as the results themselves have not changed.

We made some changes to thinklytics again, mainly an overhaul of the pictures - changes are reflected in the previous post.

We also created a 'singleProject' script to create project-specific analytics, as we did for rephetio.

Antoine Lizee April 13, 2016

As a result of @larsjuhljensen comment on twitter, we ran into this image which is a slightly different version of the stream chart we had already, but with a more appealing flow-like rendering. We reproduced it:

User activity streamchart

Daniel Himmelstein Aug. 7, 2016

Updated Project Rephetio contribution plot

I updated @alizee's user contribution plot above with the latest data and some small modifications (notebook):

Contributions per user

The plot shows cumulative contribution per user over time. Contribution is measured as the square root of the total characters in comments and notes by a user in Project Rephetio up to a given date. The data is smoothed so it's easier on the eyes (less jagged). Unlike @alizee's plot, all contributors are included (no 500 character minimum). The names of users who contributed over 4,000 characters are noted.

Daniel Himmelstein Aug. 8, 2016

Error retrieving the JSON export

I'm configuring dhimmel/thinklytics to use continuous integration with a set environment managed by docker [1]. Unfortunately, I've been getting an error exporting the rephetio project to JSON. The error returns an HTML page rather than JSON for https://think-lab.github.io/p/rephetio/export.json. Highlights of the HTML response include (see full log):

Host Error
What happened? The web server reported a bad gateway error.
What can I do? Please try again in a few minutes.

I've verified happens locally as well as Travis CI, although it's occurrence isn't guaranteed. I wonder if the error is caused by the larger size of Project Rephetio compared to other projects. @jspauld can you look into this?

Jesse Spaulding: Hm, it's still working in the browser for me. Is it for you? Is the problem intermittent? Also, keep in mind the export requires a logged in user.
Daniel Himmelstein: The problem is intermittent, but when exporting all projects, the chance of not having any failures is low. The issue affects many projects not just rephetio: i.e. several projects can export properly before an arbitrary project fails. The error has happened to me in the browser (see screenshot), on Travis, and locally. However, on occasion all projects will export successfully on Travis (example) and locally. Here's the code for automatically exporting a project, which creates a session and logs in.
Daniel Himmelstein: I modified our exporter to continually retry failing exports until they succeed. So now our continuous integrations have begun to succeed, although they must first endure several failures (example).