Natural Language Processing in NJ

Thursday, October 25, 2012

Shay Cohen: Consistent and Efficient Algorithms for Latent-Variable PCFGs

Shay Cohen will present a talk at Rutgers titled "Consistent and Efficient Algorithms for Latent-Variable PCFGs"

Date: Friday, November 2
Time: 11:00am
Location: Rutgers School of Communication and Information, 4 Huntington St., New Brunswick, NJ (Faculty Lounge, Room 323) (map)

ABSTRACT:

In the past few years, there has been an increased interest in the machine learning community in spectral algorithms for estimating models with latent variables. Examples include algorithms for estimating mixture of Gaussians or for estimating the parameters of a hidden Markov model.

Until the introduction of spectral algorithms, the EM algorithm has been the mainstay for estimation with latent variables. Still, with EM there is no guarantee of convergence to the global maximum of the likelihood function, and therefore EM generally does not provide consistent estimates for the model parameters. Spectral algorithms, on the other hand, are often shown to be consistent.

In this talk, I am interested in presenting a spectral algorithm for latent-variable PCFGs, a model widely used for parsing in the NLP community. This model augments the nonterminals in a PCFG grammar with latent states. These latent states refine the nonterminal category in order to capture subtle syntactic nuances in the data. This model has been successfully implemented in state-of-the-art parsers such as the Berkeley parser.

The algorithm developed is considerably faster than EM, as it makes only one pass over the data. Statistics are collected from the data in this pass, and singular value decomposition is performed on matrices containing these statistics. Our algorithm is also provably consistent in the sense that, given enough samples, it will estimate probabilities for test trees close to their true probabilities under the latent-variable PCFG model.

If time permits, I will also present a method aimed at improving the efficiency of parsing with latent variable PCFGs. This method relies on tensor decomposition of the latent-variable PCFG. This tensor decomposition is approximate, and therefore the new parser is an approximate parser as well. Still, the quality of approximation can be theoretically guaranteed by inspecting how errors from the approximation propagate in the parse trees. interests span a range of topics in natural language processing and machine learning. He is especially interested in developing algorithms and methods for the use of probabilistic grammars.

BIO:

ShayCohen is a postdoctoral research scientist in the Department of Computer Science at Columbia University. He is currently a Computing Innovation fellow (NSF/CRA). He received his B.Sc. and M.Sc. from Tel Aviv University in 2000 and 2004, and his Ph.D. from Carnegie Mellon University in 2011. His research

Monday, September 24, 2012

Amanda Stent: InteractiveX: Generating multimedia summaries of spatio-temporal data sets

Amanda Stent will give a talk entitled "InteractiveX: Generating multimedia summaries of spatio-temporal data sets"

Date: October 1, 2012

Time: 2:30pm

Location: ETS, Anrig Hall, Room P-016 (directions | campus map)

VISITORS TO ETS: Please contact Joel Tetreault (jtetreault at ETS dot org) for security and arrival information.

ABSTRACT:

Organizations and individuals increasingly have to deal with large to very large data sets that include spatio-temporal information, such as network traffic data, credit card records, wildlife tracking data, exam scores by time and place, and even data from social media such as twitter. We have access to sophisticated statistical and visualization tools for analyzing these data sets. However, the output from these tools is frequently only understandable by experts -- and even experts can start to suffer from information overload. We have designed a system, interactiveX, that guides users to understand the meaning of large spatio-temporal data sets through automatic creation of interactive multimedia explanations that combine text, graphics and the results of data analysis. In this talk, we will first outline the challenges of this task. We will then present the architecture of our system, demonstrate some user interfaces to our system, and describe some recent research results from this work.

BIO:

Dr. Amanda Stent works on spoken dialog, natural language generation and assistive technology. She is currently a Principal Member of Technical Staff at AT&T Labs - Research in Florham Park, NJ and was previously an associate professor in the Computer Science Department at Stony Brook University in Stony Brook, NY. She holds a PhD in computer science from the University of Rochester. She has authored over 70 papers on natural language processing and holds several patents. She is VP of the ACL/ISCA Special Interest Group on Discourse and Dialog and one of the rotating editors of the journal Dialogue and Discourse.

Friday, September 21, 2012

Biemann Slides and Site Features

The slides from the Chris Biemann talk at ETS on September 20, titled "Text: Now in 2D -- Lexical Expansion Using Contextual Similarity", are now available to view and download.

We will host an archive of materials from past talks here. They will be found to the right under the heading "Links to Materials". Other recent additions to the site include a calendar and a list of upcoming events. The calendar can be found at the bottom of the page, and will be updated with any events posted here. For a quick view of upcoming events, check under the heading "Upcoming NLP Events" at the top of the column to the right. Any suggestions for further improvements and/or announcements are welcome!

Thursday, September 6, 2012

Chris Biemann: Text: Now in 2D — Lexical Expansion using Contextual Similarity

Chris Biemann will present a talk titled “Text: Now in 2D — Lexical Expansion using Contextual Similarity”.

Date: September 20, 2012
Time: 11:00am
Location: ETS, Conant Hall, Lounge A (directions | campus map).

ABSTRACT:

This talk introduces the metaphor of two-dimensional text. Starting from very basic concepts of structural linguistics, we define lexical expansion mechanisms that generate, for each term in context, a weighted list of possible expansions. While the mechanism is left unspecified by the metaphor, we use distributional similarity as a source for all-words unsupervised lexical expansion. Handling word sense ambiguity in the expansion mechanism will be discussed from two angles: Either a contextualized method can rank similar terms of the correct sense higher, or we can use a word sense induction clustering in order to aggregate over common features of the potential expansions.

This new representation has been successfully used in tasks like semantic text similarity and knowledge-based all-words word sense disambiguation. The key element of this representation and the method that computes it is that it can bridge lexical gaps and align passages that bear the same meaning without using the same words. Thus, it can be used as a basis technology for passage and answer scoring, and essay grading.

BIO:

Chris Biemann holds an MA and doctorate degree from the University of Leipzig, Germany. After his PostDoc at the semantic search start-up Powerset and subsequently the Microsoft Bing Search Engine, Chris became an assistant professor for language technology at the Technische Universität Darmstadt last year. His main research interests span unsupervised, knowledge-free acquisition, graph-based representations and algorithms, crowdsourcing, and big data for NLP applications. Currently, Chris is a visiting researcher at the IBM Watson Research Lab in Hawthorne NY, working with the Watson DeepQA team.

David Kaufer: Tools for Building Social Communities Around Texts and Text Analysis

David Kaufer will present a talk titled "Tools for Building Social Communities Around Texts and Text Analysis"

Date: September 13, 2012
Time: 1:30pm
Location: ETS, Conant Hall, Lounge A (directions | campus map).

ABSTRACT:

This talk will present two technologies used at Carnegie Mellon in the English department for creating social communities around texts (Classroom Salon) and text analysis (DocuScope). Classroom Salon (www.classroomsalon.org) is a web-based tool used to support writing and content classrooms. Classroom Salon supports both anchored and global annotation of text and in humanities classrooms allows teachers and students to create classroom discussions around text before the class meets.
This allows teachers to monitor each student's participation and depth of reading. In science classrooms, Classroom Salon has been used to assess student's comprehension of difficult concepts. It is being tested at the University of Wisconsin -- Milwaukee with a Gates Foundation Grant. Preliminary results show science teachers like it because it helps them gauge students' understanding of the material and adapt it accordingly.

DocuScope (http://www.cmu.edu/hss/english/research/docuscope.html) is a stand-alone java application that consists of large string-based dictionaries of English rhetorical patterns developed over a decade of inspection of texts. These patterns have been shown to accurately classify major genres of written English. They have been used to understand precisely how one genre of English differs from another (letters vs. reminiscences) or the variation that takes place within a single genre (the different rhetorical strategies that can define a reminiscence). In this talk, I'll review the main breakdown of the dictionaries. The dictionaries were first developed to support a course in comparative genres of English and I will also discuss some educational applications.

BIO:

David Kaufer is Professor of English at Carnegie Mellon. From 1994 to 2009, he was the Head of the Department. He serves on the Executive Board of the Rhetoric Society of America. He is the lead author of five books and co-author of two more. He is the author of over 100 refereed articles in the fields of text-based rhetorical analysis, rhetorical theory, and written composition.

Welcome!

Welcome to the NJ-NLP blog! This site will be used as a hub for sharing information about Natural Language Processing (NLP) events occurring in New Jersey. We will announce events and talks related to NLP, including location, time, topic/title, abstract and any other pertinent information.

The goal of this site is two-fold. First, it will serve as a sort of calendar of NLP events for New Jersey. Second, it will provide an overview of the types of work being done at the various institutions where NLP research is undertaken throughout the state. The blog is an open-ended idea, at this time, so it may evolve to include a broader range of posts. Anything that encourages a community of collaboration and the exchange of ideas is welcome.