In my research on recognizing children's understanding of science concepts, I led the development of an annotated corpus of elementary students' responses to assessment questions.
We acquired grade 3-6 responses to 287 questions from the Assessing Science Knowledge (ASK) project (Lawrence Hall of Science, 2006). The responses, which range in length from moderately short verb phrases to several sentences, cover sixteen diverse teaching and learning modules, spanning life science, physical science, earth and space science, scientific reasoning, and technology. We generated a corpus by transcribing a random sample (approximately 15400) of the students' handwritten responses.
The ASK assessments included a reference answer for each of their constructed response questions. We decomposed these reference answers into fine-grained facets and annotated each facet according to the student's apparent understanding of that facet. Please see my Publications page for more detail regarding the corpus, in particular see:
Rodney D. Nielsen, Wayne Ward, James H. Martin and Martha Palmer. (2008). Annotating Students' Understanding of Science Concepts. In Proceedings of the Sixth International Language Resources and Evaluation Conference, (LREC'08), Marrakech, Morocco, May 28-30, 2008. Published by the European Language Resources Association, (ELRA), Paris, France.
In our research on detecting domain-general features for sarcasm detection, we developed a dataset of sarcastic and non-sarcastic tweets. To do so, we downloaded tweets containing the trailing hashtags: "#sarcasm," "#happiness," "#sadness," "#anger," "#fear," "#disgust," and "#surprise" during February and March 2016. We labeled the #sarcasm tweets as sarcastic, and the tweets containing the other six hashtags (corresponding to Paul Ekman's six basic emotions) as non-sarcastic. The non-sarcastic hashtags were chosen because their associated tweets were still expected to express opinions, similarly to sarcastic tweets, but in a non-sarcastic way. Note that this almost certainly increases the difficulty of discriminating between sarcastic and non-sarcastic tweets, since both are emotionally charged (see González-Ibáñez, Muresan, and Wacholder (2011) and Ghosh, Guo, and Muresan (2015) for some interesting research regarding this), but as distinguishing between literal and sarcastic sentiment is useful for real-world applications of sarcasm detection, we consider the presence of sentiment in our dataset to be a worthwhile challenge.
The tweet IDs for tweets from the dataset that were still publicly available at the time this was posted are provided below, divided into the training and test sets we used in our work (reference below). For some of the original tweets that were no longer available, we were able to find identical publicly-available retweets, so we include the IDs for those retweets as well. For your convenience, we also provide a script for downloading the tweets here.
For much more information about our work on sarcasm detection, refer to the paper below. Please also cite this paper if you use the tweets from this dataset in your own research.
Natalie Parde and Rodney D. Nielsen. #SarcasmDetection is soooo general! Towards a Domain-Independent Approach for Detecting Sarcasm. In the Proceedings of the 30th International FLAIRS Conference. Marco Island, Florida, May 22-24, 2017.
In our research on developing a feature-based regression approach to automatic label aggregation, we developed a dataset comprised of 3,112 word pairs, each of which was scored in the context of the surrounding sentence for metaphor novelty by five crowd workers (we used Amazon Mechanical Turk to crowdsource the labels) and one expert labeler. Labels collected ranged from 0 (not metaphoric) to 3 (highly novel metaphor). This dataset is licensed under a Creative Commons Attribution ShareAlike 3.0 Unported License.
The goal of the work was to aggregate the five crowdsourced labels into a continuous predicted aggregation close to the expert's gold standard label for the instance. We did so by extracting features based on label distribution and presumed worker trustworthiness and using them to train a random subspace regression model; the full source code for our project can be found here.
We randomly divided the data collected into training, validation, and test instances, and provide those subsets below, with features already extracted. The top row of each file contains column (feature) labels, and those labels correspond to the following (refer to the paper cited later for more details):
- ID: The unique identifier for each instance, comprised of the sentence number from which the instance was extracted, and the words in the pair and their respective sentence positions.
- A1, A2, A3, A4, A5: ANNOTATIONS, ordered by label value.
- A1xR, A2xR, A3xR, A4xR, A5xR: ANNOTATIONS, ordered by the annotators' average r values.
- A1_R, A2_R, A3_R, A4_R, A5_R: AVG. R, ordered by label value.
- A1xR_R, A2xR_R, A3xR_R, A4xR_R, A5xR_R: AVG. R, ordered by the annotators' average r values.
- Avg_Annotation: AVG.
- Weighted_Avg_Annotation: WEIGHTED AVG.
- Weighted_Avg_Annotation_Good: WEIGHTED AVG. (GOOD).
- Weighted_R_HIT: HIT R.
- True_Label: The gold standard label for the instance.
We also provide a copy of the data containing only the crowdsourced and expert labels for each instance (no features):
For more information about our work on automatic label aggregation, please refer to the paper below. Please also cite this paper if you use this dataset in your own research.
Natalie Parde and Rodney D. Nielsen. Finding Patterns in Noisy Crowds: Regression-based Annotation Aggregation for Crowdsourced Data. To appear in the Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2017). Copenhagen, Denmark, September 7-11, 2017.