Work at CCU centers around applications of artifical intelligence to a wide variety of content, with a historical focus on applied and basic research on text data. Below we list a few sample projects from our group.
Transfer learning for entity recognition
In this project from 2018, we replicated and extended several past studies on transfer learning for entity recognition. In particular, we were interested in entity recognition problems where the class labels in the source and target domains are different. Our work was the first direct comparison of several previously published approaches in this problem setting. In addition, we performed experiments on seven new source/target corpus pairs, nearly doubling the total number of corpus pairs that had been studied in all past work combined. A paper based on this work received an Area Chair Favorites award at COLING 2018, a major NLP conference.
New techniques for topic modeling
In this joint project with the ECE department at UT Austin, we created a new method for network and topic modeling based on Poisson factorization. This collaboration produced a model called Joint Gamma Process Poisson Factorization (J-GPPF). Our model can extract latent communities and topics simultaneously from a dataset, as demonstrated in the following example where we modeled 1,600 Wikipedia articles tagged with “acoustic” and “machine learning” categories. J-GPPF identified the following overall top words in the articles:
|Topic Number||Top 10 Words|
|1||learning, data, algorithm, machine, model, method, used, set, problem, function|
|2||album, music, song, released, one, band, country, guitar, award, record|
|3||sound, frequency, wave, acoustic, one, time, used, source, also, tone|
We see that the model identified a machine learning topic, and correctly separated the topics related to acoustics into a musical topic and a scientific topic. The model simultaneously identified groups of Wikipedia articles associated with categories such as numerical software, singer-songwriters, acoustic guitars, and machine learning researchers. In the following graph we show the top words for the top two groups:
Our code for the J-GPPF model is released as a Scala package, linked in our github page above. We also include simplified Poisson factorization models for network-only or text-only corpora. The github link contains our papers, installation instructions, and a tutorial on joint modeling that uses bag-of-words bill summaries and a voting network derived from U.S. Senate voting records.