Transforming the Linux Kernel History into a SQL Database with pgit
By Oliver Seifert

AI Summary
I embarked on an ambitious project to import the entire Linux kernel history into pgit, a tool that stores Git repositories in PostgreSQL, making them SQL-queryable. This massive undertaking involved 1,428,882 commits and 24.4 million file versions spanning 20 years of development, all compressed into just 2.7 GB of actual data using delta compression. The import process took two hours on a dedicated server with high-performance specifications, including an AMD EPYC 7401P processor and 512 GB of RAM.
## Importing the Linux Kernel
The Linux kernel is one of the largest actively developed repositories, with 171,000 files and contributions from 38,000 developers. While other version control systems struggled with its size, pgit managed the import efficiently. The server setup included RAID 0 for maximum throughput and various OS tunings to optimize performance.
## Analyzing the Data
With the kernel history in a SQL database, I could query it to uncover fascinating insights. For instance, 38,506 authors contributed to the kernel, but only 1,540 had merge authority, highlighting the hierarchical nature of contributions. Most commits (90%) affected five files or fewer, adhering to the kernel's 'one logical change per commit' rule.
## File Coupling and Commit Patterns
Using pgit's analysis capabilities, I discovered patterns of file coupling, such as the frequent co-changes between intel_drv.h and intel_display.c in the Intel i915 GPU driver. The analysis also revealed that three people were responsible for merging 22.5% of all commits, with David S. Miller being the busiest.
## Corporate Contributions
Intel led in contributions, followed by Red Hat, which had the most productive team. The kernel's development has become more corporate over time, with individual contributors peaking at 12% in 2010 and declining to 8% by 2025.
## Bug Fixes and Profanity
The 'Fixes:' tag in commit messages has become more prevalent, with the most-referenced commit being Linus Torvalds's initial git import. Interestingly, only seven instances of profanity appeared in commit messages, all from two individuals, while the source code contained more colorful language.
## Unique Developer Stories
The analysis highlighted unique developer contributions, such as Kent Overstreet's 13-year journey to merge bcachefs into the mainline kernel and the dedication of weekend warriors who contribute significantly in their free time.
## Query Performance
The SQL queries on this vast dataset were impressively fast, with most completing in under a minute. This demonstrates the power of pgit in making large-scale code history easily accessible and analyzable.
Overall, this project not only showcased pgit's capabilities but also provided a deeper understanding of the Linux kernel's development dynamics.
Key Concepts
Delta compression is a method of storing data by recording only the differences between successive versions of the data, rather than the entire data set. This technique is commonly used to save storage space and improve efficiency.
A SQL database is a structured collection of data that is managed and queried using Structured Query Language (SQL). It allows for efficient data retrieval, manipulation, and storage.
Category
TechnologyOriginal source
https://oseifert.ch/blog/linux-kernel-pgitMore on Discover
Summarized by Mente
Save any article, video, or tweet. AI summarizes it, finds connections, and creates your to-do list.
Start free, no credit card