Riordan et al. 2020

An empirical investigation of neural methods for content scoring of science explanations

NGSS science standards dimensions

DCI (disciplinary core ideas
CCC (cross cutting concepts)
SEP (science and engineering practices

KI rubric:

involves a process of building on and strengthening science understanding by. incorporating new ideas and sorting out alternative perspectives using evidence
rewards conecting evidence to claims in their explanations

Data

Constructed reponse (CR) items are evaluated. The ones chosen are cases where SEPs need to be used while showing understanding of CCCs and DCIs.

CR Items:

Musical Instruments and the Physics of Sound Waves (MI)
Photosynthesis and Cellular Respiration (PS)
Solar Ovens (SO)
Thermodynamics Challenge (TC)

Two separate rubrics in parallel:

KI rubric
- linkage with subsets of the ideas described in the evidence statements
  - Photosynthesis (PS) listed 5 ideas related to energy and matter changes during photosynthesis
- Scores from 1-5
NGSS subscore rubric
- two of three dimensions for each CR
  - Only those that are relevant given the prompt are used (e.g. a question where the answer doesn't depend upon science and engineering practices would not have a score for that dimension)
- scores from 1-3

The thermodynamics challenge item was particularly challenging.

Sometimes there were less annotated data available for the NGSS dimension models compared to the KI models.

Models

Each item and score type were trained independently. 10-fold cross validation with train/val/test splits, evaluating on concastenated predictions across folds.

SVR

binary word unigrams and bigrams

RNN

pretrained word embeddings (GloVe 100) fed into a bidirecitonal GRU encoder.
Hidden states of GRU are pooled (max)
Encoder output aggregated in a fuly-connected feedforward layer using sigmoid act (giving scalar score).
Presumably the same scaling and unscaling is happening that we worked with before because sigmoid should be squishing everything to be between 0,1
exponential moving average across weights used during training
50 epochs

Pretrained transformer

bert-base-uncased
using [CLS] token output, fed through a non-linear layer to obtain the scalar score.
exponential moving average across weights used during training
20 epochs
When identifying best hyperparameters, for each fold, taking the epoch where validation performance is highest for evaluation.
During final training, validation and training data are concatenated and then the model is retrained.
- I assume this is done for all the models but it's only mentioned for the PT model

Results

KI models

The Pretrained transformer models are more robust, they're always ahead of the RNN on all metrics (sometimes not by much though).

The items that were highly skewed showed lower levels of human-machine agreement (lower thant he 0.7 threshold for QWK in real world scoring applicaitons) Where does that threshold come from??

Detecting Sarcasm is Extremely Easy ;) (Parde & Nielson 2018)

Harnessing Context Incongruity for Sarcasm Detection (Joshi et al 2015)

Sarcasm as Contrast between a Positive Sentiment and Negative Sentiment

Catastrophic Interference in Neural Embedding Models (Dachapally & Jones)

Querying word embeddings for word similarity and relatdness

Multi-Task Deep Neural Networks for Natural Language Understanding

Riordan et al., 2019

Horbach et al., 2019

Riordan et al. 2020

How do you determine the worth of a language?

November 6th 2019: Hai, Peng

Alan Ridel

Hai Hu 02-19-2020

Zeeshan 02-19-2020

Overview of the SPMRL 2013 Shared Task:Cross-Framework Evaluation of Parsing Morphologically Rich Languages

Dependency Parsing

Characterizing the Errors of Data-Driven Dependency Parsing Models

January 17th - Job search

Job talk Monica Nesbit

BLiMP: A Benchmark of Linguistic Minimal Pairs for English

Swahili Syntax (Anthony Vitale, 1981)

Developing Universal Dependencies for Wolof

Towards a dependency-annotated treebank for Bambara (Aplonova & Tyers 2018)

A Universal Part-of-Speech Tagset (Petrov, Das, McDonald)

Universal Depedencies v1: A Multilingual Treebank Collection

Reusing Grammatical Resources for New Languages

Estonian Dependency Treebank: from Constraint Grammar Tagset to Universal Dependencies

Learning Morphosyntactic analyzers from the bible via iterative annotation projection across 26 languages

Riordan et al. 2020

An empirical investigation of neural methods for content scoring of science explanations

NGSS science standards dimensions

KI rubric:

Data

CR Items:

Two separate rubrics in parallel:

Models

SVR

RNN

Pretrained transformer

Results

KI models

No Comments