Unstructured Data Management at Scale for Large Language Models

学术报告

当前位置: 首页学术报告

发布时间：2023-03-21 浏览次数：608

报告题目：Unstructured Data Management at Scale for Large Language Models

报告人：Dong Deng（邓栋），assitant professor in the Computer Science Department at Rutgers University (USA)

主持人：张伟教授

报告时间：2023年3月23日（星期四）18:00-19:00

报告地点：普陀校区理科楼B112室

报告摘要：

A clear trend in machine learning is that model becomes larger and larger and more and more training data is used. For example the number of parameters and training corpora size of large language models (LLM) both grow around 1000 times in the past few years. As a result, the latest LLMs are trained on TB-level data, which brings significant challenges to data management: even a simple operation on the training data entails a huge amount of computation. Recent studies find LLMs memorize part of the training data, which brings significant privacy risks. In this talk, we discuss how to evaluate the LLM memorization behavior quantitively. For this purpose, we develop an efficient and scalable near-duplicate sequence search algorithm. Given a query sequence, it finds (almost) all the near-duplicate sequences in the TB-level training corpus. Note a sequence is a snippet in a text and thus the number of sequences in a text is quartic to the text length. In addition, we briefly discuss how to remedy the LLM memorization by efficient training data deduplication.

报告人简介：

Dong Deng（邓栋） is an assistant professor in the Computer Science Department at Rutgers University. His research interests include large-scale data management, data science, database systems, and data curation. Before joining Rutgers, he was a postdoc in the Database Group at MIT, where he worked with Mike Stonebraker and Sam Madden on data curation systems. He received his Ph.D. from Tsinghua University with honors. He has published over 30 research papers at top database venues, mainly SIGMOD, VLDB, and ICDE. Based on Google Scholar, his publications have received over 2000 citations.

中山北路3663号理科大楼 200062