Friday, 18th October - 11am Henry Thompson : Seminar Title: Common Crawl and ChatGPT: From obscurity to centre stage Abstract: Common Crawl is a multi-petabyte longitudinal dataset containing over 100 billion web pages which is widely used as a source of language data for sequence model training and in Web science research. In this talk I'll give a short introduction to Common Crawl (CC): how its constituent archives are created, their structure and what kind of tooling you need to work with them. We have local copies of a dozen CC archives attached to Cirrus, one of EPCC's multiprocessor compute resources. I'll briefly describe some recent work I've done using these archives, including a surprising insight they provide into how the Web has been changing from human-authored to automatically constructed. CC was the single largest data source used to train GPT-3, the base LLM for the first ChatGPT releases. CC will continue to provide new monthly archives, which are very likely to provide the bulk of the data used to train new versions of the big commercial LLMs. There is a very real risk that as the role of ChatGPT and its competitors in producing text for the Web grows, the proportion of information to be found on the Web, and thus in new archives, which is both new and true, will decrease, drowned by LLM-generated text. Unfortunately, to date none of the big players have shown any (public) interest in watermarking or otherwise identifying their output. So the talk will finish with a sketch of some ideas about how to use CC archives as a basis for a classifier for Web pages as first-generation human or not. Then I plan to open the floor to discussion about what a research proposal might look like exploring detection of LLM output on the Web. Short bio: Henry S. Thompson is Professor of Web Informatics in the University of Edinburgh School of Informatics. His research interests have ranged widely, originally in the areas of speech and language processing, more recently focussing on understanding and articulating the architectures of the Web. He retains an interest in the fundamental goals of Cognitive Science as originally conceived, believing that both sub-symbolic and symbolic mechanisms have a part to play in an adequate account of human cognition. He is commited to the proposition that our professional responsibilities and the success of our intellectual pursuits both depend on keeping our humanity clearly in view. He was a member of the SGML Working Group of the World Wide Web Consortium (W3C) which designed XML, a major contributor to the core concepts of XSLT and W3C XML Schema and a member of the XML Core and XML Processing Model Working Groups at the W3C. He was elected five times to the W3C TAG (Technical Architecture Group), serving from 2005 to 2014, and continues to contribute to Web standards through the W3C and IETF. Oct 18 2024 11.00 - 12.00 Friday, 18th October - 11am Henry Thompson : Seminar This event is co-organised by ILCC and by the UKRI Centre for Doctoral Training in Natural Language Processing, https://nlp-cdt.ac.uk. IF G.03 and on Teams Contact
Friday, 18th October - 11am Henry Thompson : Seminar Title: Common Crawl and ChatGPT: From obscurity to centre stage Abstract: Common Crawl is a multi-petabyte longitudinal dataset containing over 100 billion web pages which is widely used as a source of language data for sequence model training and in Web science research. In this talk I'll give a short introduction to Common Crawl (CC): how its constituent archives are created, their structure and what kind of tooling you need to work with them. We have local copies of a dozen CC archives attached to Cirrus, one of EPCC's multiprocessor compute resources. I'll briefly describe some recent work I've done using these archives, including a surprising insight they provide into how the Web has been changing from human-authored to automatically constructed. CC was the single largest data source used to train GPT-3, the base LLM for the first ChatGPT releases. CC will continue to provide new monthly archives, which are very likely to provide the bulk of the data used to train new versions of the big commercial LLMs. There is a very real risk that as the role of ChatGPT and its competitors in producing text for the Web grows, the proportion of information to be found on the Web, and thus in new archives, which is both new and true, will decrease, drowned by LLM-generated text. Unfortunately, to date none of the big players have shown any (public) interest in watermarking or otherwise identifying their output. So the talk will finish with a sketch of some ideas about how to use CC archives as a basis for a classifier for Web pages as first-generation human or not. Then I plan to open the floor to discussion about what a research proposal might look like exploring detection of LLM output on the Web. Short bio: Henry S. Thompson is Professor of Web Informatics in the University of Edinburgh School of Informatics. His research interests have ranged widely, originally in the areas of speech and language processing, more recently focussing on understanding and articulating the architectures of the Web. He retains an interest in the fundamental goals of Cognitive Science as originally conceived, believing that both sub-symbolic and symbolic mechanisms have a part to play in an adequate account of human cognition. He is commited to the proposition that our professional responsibilities and the success of our intellectual pursuits both depend on keeping our humanity clearly in view. He was a member of the SGML Working Group of the World Wide Web Consortium (W3C) which designed XML, a major contributor to the core concepts of XSLT and W3C XML Schema and a member of the XML Core and XML Processing Model Working Groups at the W3C. He was elected five times to the W3C TAG (Technical Architecture Group), serving from 2005 to 2014, and continues to contribute to Web standards through the W3C and IETF. Oct 18 2024 11.00 - 12.00 Friday, 18th October - 11am Henry Thompson : Seminar This event is co-organised by ILCC and by the UKRI Centre for Doctoral Training in Natural Language Processing, https://nlp-cdt.ac.uk. IF G.03 and on Teams Contact
Oct 18 2024 11.00 - 12.00 Friday, 18th October - 11am Henry Thompson : Seminar This event is co-organised by ILCC and by the UKRI Centre for Doctoral Training in Natural Language Processing, https://nlp-cdt.ac.uk.