Friday, 29th November - 11am Pedro Ortiz Suarez : Seminar

Title: Harnessing Common Crawl for NLP and ML applications

 

Abstract:

This presentation looks at effective strategies for using Common Crawl's web archive in large-scale research applications, specifically for NLP and other ML applications. We will discuss practical approaches to processing and filtering Common Crawl’s datasets, with focus on how to overcome computational challenges and optimise data pipelines. We will also discuss some of the challenges that users might encounter related to the multilingual and heterogenous nature of Common Crawl’s data.  The talk will cover best practices for data filtering, pre-processing, and storage, to ensure the quality and relevance of extracted information for research tasks. Additionally, we will briefly discuss the ranking mechanism used to determine whether a URL is crawled, and demonstrate how to use the Web Graph as a framework for further research.

Bio: 

Pedro is a mathematician, computer scientist and researcher. He holds a PhD in computer science and Natural Language Processing from Sorbonne Université. Pedro’s research has mainly focused on how data quality impacts ML models’ performance and how to improve these models through data-driven approaches. Pedro has been a main contributor to multiple open source Large Language Model initiatives such as CamemBERT, BLOOM and OpenGPT-X. Prior to joining the Common Crawl Foundation, Pedro was the founder of the open source project OSCAR, that provides high performance data pipelines to annotate Common Crawl’s data and make it more accessible to NLP and LLM researchers and practitioners. Pedro has also contributed to many projects in information extraction and other NLP applications for both the scientific domain and Digital Humanities.