CLINGDINGS

How do you determine the worth of a language?
November 6th 2019: Hai, Peng
Alan Ridel
Hai Hu 02-19-2020
Zeeshan 02-19-2020

How do you determine the worth of a language?

Arle Lommen October 30 2019

2019 is the UN year of indigenous languages.

Highly idealized language ideals
- Everyone be able to use their own language as they see fit.
- Obviously this isn't exactly how this works.

Every language has an intrinsic value

However, in a world of limited resources, not all 7,000 of the world's languages can be invested in.

less than 1% of content is translated into another language

To cover 100% of content this would take about 20 million translators. This is only for one additional language though.

to cover all 135 economically important languages we would need 2 billion translators.

Let's look at other views of value

the number of speakers does not determine the value of a language for a business

Though this may be exactly the information that is relevant to NGO's or religious organizations.

Maybe we can look at GDP?

Could look at GDP per capita to get a sense of the wealth of the individual **I'm not sure why wealth per individual is

Internet adoption rate

More relevant to today's globally connected tech powered companies.

Pre 2019 CSA was selecting 50? language with online relevance (e.g. usage by communities online).

Calculated number of speakers for each country/territory.

Used a zero-sum approach (no accounting for multi-lingualism).

Assigned languages to four tiers basd on cumulative market research.

This measure is called eGDP or electronic GDP, this is not a measure of ecommerce.

after 2019,

Added multilingualism
they expanded from 300 locales to 500 locales.
Added model of income inequality to help scale GDP (e.g. if 12% of a country's pop is online, those people probably have higher GDP than the average of the whole population)

November 6th 2019: Hai, Peng

Product harm report evaluation

Product harm crises are when products cause indicents and lead to issues and then the public response produces negative publicity for the company and the government body in charge of regulation

Two issues for issuing a recall

Delayed announcement of recall
- Food recalls take an average fo 57 days after discovery
- Automotive recalls average of 306 days later (US))
Low recall completion (small proportion of products that should be recalled are recalled)

Legal wiggle room

Have to explain how recall was discovered
What steps were taken to determine whether recall should be done
free to determine how they release the information and how much information they release

Research questions

do recall commmunication examples differ across industries

Hypothesis

LOnger the recall takes the worse the company is viewed
The more steps taken the more optimistic the more favorably the company is viewed

the idea is that these shape the way that the company frames their response.

The model needs to account for year effects, firm effects, etc.
dependent variable is linguistic variables
independent variable is number of steps taken by the company and time taken to report

argument structure is crucial for previous research, in addition to subjectivity measures

difference emerges in number of content words (nouns, verbs adjectives and adverbs)

word (lexical complexity )

MATTR (moving average ttr)
STTR mean ttr for every 100 words
CTTr (corrected ttr) types / sqrt(2 * #tokens)

Structural complexity (length of t-unit + dependency length (what is a tunit?)

Reading ease score takes into consideration number of syllables per word and number of words per sentence.

However, the number of syllables per word is hard to reliably calculate.

Alan Ridel

We know that there were a large number (25,000 books published during the victorian era) of books, we have a lot of information about gender and year level stats.

no corpus that exists reflects the population of published novels during this period effectively.

The Chadwyck-Healey corpus is particularly bad, 50% of the data comes from male authors published before 1876 even though this was only 15% of the population.

Random sampling of the population is not really possible because we don't actually have a complete database of all novels published during the victorian era.

instead we do quota sampling.

We divide up the population into categories based on year and gender and manually encode a randomly selected chapter.

Not a representative sample
- overrepresents authors who wrote more than one novel
- over represents novels published in multiple volumes

Maybe there's a bias in which things were published or which types of genres tend to do multi volume things

The solution is to use post-stratification as a way to do analysis of granular distinctions after the fact:

e.g. novels published by women in 1940
novels involving trains

Hai Hu 02-19-2020

Building a natural language inference dataset in Chinese

What is NLI?

when you have to determine whether a hypothesis contradicts, entails from or is neutral towards a premise.

Issues with SNLI

Turkers do not want contradiction to go both ways.

Bias in hypotheses

If you train on SNLI on just the hypotheses, you get better than majority baseline.

There's bias in the hypotheses One thing is that sleeps contradicts almost any other action. Additional heuristics in the dataset probably introduced by the Turkers probably exist. By creating synthetic data that goies against the heuristics, the result is very very poor performance (19% accuracy for BERT was the best).

XNLI:

15 languages
translated from SNLI/MNLI
- bad quality translation, lots of things that just don't translate well

Our chinese NLI

undergrads instead of turkers
told to write 3 neutral, 3 contradiction, 3 entail as a way of getting them to introduce more variety.
Students still apply heuristics.
Issues that emerged:
- phone call transcriptions are bad
- use of questions in premises was confusing

Todo

how to get more variation in hypotheses?
one annotator only writes Entailments not C/N

Zeeshan 02-19-2020

Internship at Amazon and forthcoming thesis

What is transfer learning?

Transfer learning is a a variety of different things. For a taxonomy read Ruder 2019.
pretraining of word embeddings is probably the most famous form of transfer learning.

Multi-task learning

Hard vs soft parameter sharing Hard parameter sharing literally shares some of the initial layers and then has task specific layers towards the end.

Soft parameter sharing uses soem method of regularization to force common layers for the two tasks to be close to eachother.