Friday, 3rd March 2023 - 11am Dan Vilenchik : Seminar | ILCC

Title: Two new approaches to textual data augmentation

Abstract:

Data augmentation is a widely studied topic in visual tasks (e.g. image classification), but far less so for textual tasks. We present two recent papers (currently under review) which offer two novel approaches. The first paper deals with that ask of modeling human personality. The field of computational personality analysis heavily relies on labeled data, which may be expensive, difficult, or impossible to get. This problem is amplified when dealing with rare personality types or disorders (e.g., the anti-social psychopathic personality disorder). In this context, we developed a text-based data augmentation approach for human personality (PEDANT). PEDANT doesn't rely on the common type of labeled data but on the generative pre-trained model (GPT) combined with domain expertise. Testing the methodology on three different datasets, provides results that support the quality of the generated data. The second paper deals with the task of hate speech detection, which hinges upon the availability of rich and variable labeled data, which is hard to obtain. In this work, we present a new approach for data augmentation that uses as input real unlabeled data which is carefully selected from online platforms where invited hate speech is abundant. We show that by harvesting and processing this data (in an automatic manner) one can augment existing manually labeled datasets, to improve the classification performance of hate speech classification models. We observed an improvement in F1-score ranging from 2.7% and up to 9.5%, depending on the task (in or cross domain) and on the model that was used.

The talk is based on two papers

“Data Augmentation for Modeling Human Personality: The Dexter Machine”, a joit work with Yair Neuman and Vlad Kozhukhov (BGU). Paper under review.

“HARALD: Augmenting Hate Speech Data Sets with Real Data”, joint work with Tal Ilan (Master thesis). Paper was accepted to the Findings of the EMNLP 2022.

Short bio:

Dan Vilenchik holds a PhD in computer science from Tel Aviv University. He did a postdoc at UC Berkeley and UCLA. He is currently a tenured member of the school of Electrical Engineering at Ben-Gurion University. His research includes both theoretical (challenges of high-dimensional data) and applicative (NLP and multidisciplinary projects) aspects of machine learning.

https://www.bgu.ac.il/~vilenchi