ARTICLEhamel.dev9 min read

The Resurgence of Data Science in the Age of AI

By Hamel Husain

The Resurgence of Data Science in the Age of AI

AI Summary

Once hailed as the 'sexiest job of the 21st century,' the role of the data scientist has evolved significantly. As the tech landscape shifted, the emergence of Machine Learning Engineers (MLEs) seemed to overshadow traditional data science roles. However, the advent of large language models (LLMs) and foundation-model APIs has changed the game, allowing teams to integrate AI without the direct involvement of data scientists or MLEs. This shift has led some to question the relevance of data scientists in the AI development process.

Despite these changes, I argue that the core tasks of data scientists remain crucial. While training models might not be the primary focus anymore, setting up experiments, debugging stochastic systems, and designing metrics are still essential. Simply calling an LLM over an API does not eliminate the need for these tasks. In my talk, 'The Revenge of the Data Scientist,' I emphasized that much of the underlying framework for AI systems is rooted in data science.

A key component of this framework is the 'harness,' which includes observability stacks like logs, metrics, and traces. These elements are vital for ensuring AI systems operate correctly and efficiently. Unfortunately, many engineers lack a deep understanding of these components, leading to misconceptions about the relevance of retrieval and evaluation processes.

I identified several pitfalls in AI development that highlight the importance of data science. The first is the reliance on generic metrics, which often fail to diagnose specific application failures. Data scientists, on the other hand, would delve into the data to develop custom metrics that address specific issues. Another pitfall is the use of unverified judges, where LLMs are used to rate outputs without proper validation. Data scientists would ensure the reliability of these judges by treating them as classifiers and verifying their accuracy with human labels.

Experimental design is another area where data science expertise is crucial. Many teams rely on synthetic data generated by LLMs, which can lead to unrepresentative test sets. Data scientists would ground synthetic data in real-world examples and design metrics that are actionable and tied to business outcomes. Additionally, the importance of proper data labeling cannot be overstated. Data scientists emphasize the need for domain experts to be involved in the labeling process to ensure quality and relevance.

Finally, the temptation to automate too much of the AI development process can lead to significant issues. While LLMs can assist with certain tasks, they cannot replace the nuanced understanding that comes from directly engaging with the data. The root cause of many pitfalls in AI development is the neglect of fundamental data science principles. By embracing these principles, we can ensure the continued relevance and impact of data science in the AI era.

Key Concepts

Data Science

Data science involves using scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data. It combines statistics, data analysis, and machine learning to understand and analyze actual phenomena with data.

Machine Learning

Machine learning is a subset of artificial intelligence that involves the use of algorithms and statistical models to enable computers to improve their performance on a specific task through experience, without being explicitly programmed.

Category

Technology
M

Summarized by Mente

Save any article, video, or tweet. AI summarizes it, finds connections, and creates your to-do list.

Start free, no credit card