Compounds

The relationship between compound parts is not always easy to guess. Compound related NLP challenges ... 2003; Popovi´c et al., 2006; Stymne 2008, ...

0 downloads 201 Views 181KB Size
Compounds Sara Stymne

August 31, 2016

What is compounding?

A type of word formation (cf derivation, . . . ) ”The process of forming a word by combining two or more existing words” (Trask, 1993) ”a compound is a lexeme (less precisely, a word) that consists of more than one stem” (Wikipedia, 2016)

Examples adventure story competition-friendly sailboat

Open vs closed compounds

Open: written as separate words apple tree, taxi driver Common in English

Closed: written as single words strawberry, ¨ appeltr¨ ad, Hochhaus Germanic languages, Finnish, . . . ; exists in English

Open vs closed compounds

Open: written as separate words apple tree, taxi driver Common in English

Closed: written as single words strawberry, ¨ appeltr¨ ad, Hochhaus Germanic languages, Finnish, . . . ; exists in English

Hyphenated car-wash, EU-l¨ ander, G8-Staaten

Compound examples

Di↵erent parts-of-speech Di↵erent constructions for English translations Compound informationssamh¨alle a¨ndringsf¨orslag yttrandefrihet proteinrik framf¨orallt klarg¨ora

Gloss information society change suggestion statement freedom protein rich before everything clear make

Translation information society amendment freedom of speech protein-rich mainly clarify

Compounding forms/filler letters

There are often changes to 0 Kind+phase +s Kinds+Lage +es Kindes+Schutz +er Kinder+Film – Ein-Kind-Politik -a -a/+u

gat+sten gatu+konst

the form of compound modifiers: child-rearing period fetal position child protection children’s film one-child-policy paving stone street art

Transparency

Compounds can be more or less transparent: apple tree flea market strawberry paperback honeymoon

Compound semantics

The meaning of compound parts can be related in di↵erent ways Many classifications exists ´ S´eaghdha and Copestake (2009): O BE plastic box HAVE polio su↵erer IN air disaster ABOUT tax law ACTOR taxi driver INSTRUMENT rice cooker Finding paraphrases for compounds ”A box made of plastic”, ”A tool for cooking rice”

NLP challenges

Compounds are common in many languages Compounds are productive in many languages Many compounds are rare Closed compounds need to be analysed in their parts The relationship between compound parts is not always easy to guess

Compound related NLP challenges

Compound splitting Compound merging e.g. in MT or speech recognition grammar checking: erroneously split compounds

Compound interpretation / paraphrasing Compound bracketing school book shelf (school book) shelf OR school (book shelf)

NLP tasks where compound processing is important

Machine translation Information retrieval Recognizing textual entailment Parsing Speech recognition ...

Open and closed compounds for NLP

The challenges are di↵erent for languages with closed and open compounds Closed compounds Sparsity, many rare word forms Compound splitting is often useful ...

Open compounds Treating multi-words as a unit Semantic interpretation ...

Example: SMT from a compounding language

Sparse data problems – many compounds are OOVs Compounds are often mis-aligned, leading to bad translations Solution: Split compounds before training/translating Much research, e.g. Nießen and Ney, 2000; Koehn and Knight, 2003; Popovi´c et al., 2006; Stymne 2008, . . .

Example: SMT into a compounding language

Sparse data problems Fewer compounds in SMT output than in human translations Compounds can be realised as: separate words other constructions only one part of the compound translated

My work on this issue: Split compounds Improve the order of compounds by using POS sequence models heuristics and ML for compound merging

Summary

Compounds are interesting There are many di↵erent aspects and tasks that you can focus on

Compounds are challenging for many NLP tasks Compounds are fun!

Some possible project sketches

Improving methods for compound splitting, possibly with automatic identification of compounding forms Applying and comparing compound processing for a ”new” language Finding paraphrases for compounds (open or closed) Compound splitting and/or merging for SMT Removing modifiers and keeping only compound heads for parsing A contrastive study of compound use in two/several languages