The Illinois Data Analytics team has had an eventful couple years wading through the massive amounts of data generated in our Coursera courses. Although our team includes experienced data scientists and educational experts, it's been both daunting and exciting to find ourselves in the position of "discovering data." This blog post will be the first of many in which we share our discoveries, stumbles, and lessons-learned as we make our way through the corn maze that is Illinois MOOC data. We hope these posts contribute to wider discussions about data that have recently emerged in the MOOC universe, and serve as an entry-point for educational researchers looking to better understand learning analytics as it relates to their own questions about education.
In order to lay the groundwork for research, our primary goal has been to define, label, and categorize as much information as we could wrap our heads around. This includes creating a single (huge) merged data set that incorporates all of the streams of data from our session-based courses (clickstream, SQL and survey data). Thus far we've identified and added descriptions for approximately 200 variables, which we anticipate will grow as we work with researchers to identify important and interesting points of analysis.
Our compiled data set documentation is available to the public. We welcome suggestions for improvement.
Along with documenting our data, we've made a number of decisions about how to determine specific variables, such as location. Living in an age where global mobility and access to transport networks are becoming increasingly ubiquitous, we quickly realized the difficulty in capturing one's location. Complicating matters even further, using an IP address (generally regarded as an accurate means to determine location), becomes unreliable when VPNs are thrown into the mix. As such, we decided to define one's "location" by using their most frequent IP access point. Although this decision may seem inconsequential when working with smaller data sets or working with on-campus course data, it becomes highly relevant when dealing with millions of mobile course participants.
As we continue to develop and refine our protocols for collecting and curating the massive amounts of data from our session-based courses, we will soon embark on a new journey of discovering data as we gear-up to work with "on-demand" data. On-demand courses present many new methodological and analytical challenges since they are not bound by specific start and end dates. We're still in the early stages of figuring out how to approach this data, but remain excited about the possibilities of discovering something new and interesting. We'll keep you abreast of our progress, which will likely include a number of failures, but hopefully a few successes as well!