Navigation Menu

Josh Waxman

Josh Waxman

Ph.D. Student

Computer Science Program
CUNY Graduate Center
The City University of New York
Ph.D. Program Homepage:

Contact Info

Email Address:
joshwaxman (at) gmail (dot) com

Postal Address:
The LATLab, NSB A-207-A
Computer Science Department
Queens College/CUNY
65-30 Kissena Blvd
Flushing, NY 11367

Research Interests

Computational linguistic techniques for low-density languages, techniques for bridging natural language processing tools and resources from high-density languages to related low-density languages.

Current Research Abstract

The state of low-density NLP is currently haphazard, rather than systematic, and is often approached as an engineering task, considering different architectures, given different languages and available resources. The resulting architectures are then tested on the unique corpus and language for which this approach has been designed. This makes it difficult to truly evaluate which approach is truly the “best”, and which approach is best for a given language.

I propose to re-implement the state-of-the-art architectures and approaches to low-density language Part-Of-Speech Tagging, all which exploit a relationship between a high-density (HD) language and a low-density (LD) language. As a novel contribution, I will test these on a representative sample of twenty (HD – LD) language pairs, all drawn from the same massively parallel corpus, Europarl. With this testbed in place, I will be able to perform never-before-possible comparisons, to evaluate which broad approach performs the best for particular language pairs, and investigate whether particular language features should suggest a particular NLP approach.

My survey of the existing approaches has suggested some novel approaches which have not been explored. I believe that these approaches could either yield better performance or be quicker to implement than some of the existing approaches.

I propose to implement two innovative approaches. The first is a language-ifier, which modifies an LD-corpus to be more like an HD-corpus, or alternatively, modifies an HD-corpus to be more like an LD-corpus, prior to supervised training. I intend to implement three deliberately restrictive language-ifiers – lexical replacement, affix replacement, cognate replacement, and exemplar replacement. The second is an iterative approach, in which the resulting model of one round of supervised training is applied to a corpus, high-confidence tagged sentences are selected, and a new round of training occurs, until convergence.