T
traeai
登录
返回首页
InfoQ

How LinkedIn Identified a Kernel Lock Contention Issue Causing Recurring System Freezes

7.2Score
How LinkedIn Identified a Kernel Lock Contention Issue Causing Recurring System Freezes

TL;DR · AI 摘要

LinkedIn identified a kernel lock contention issue causing system freezes by analyzing system logs and using performance monitoring tools, leading to improved system stability.

核心要点

  • LinkedIn used system logs and performance monitoring tools to identify kernel lock contention.
  • The issue was resolved by optimizing the locking mechanism in the kernel.
  • Improved system stability resulted from addressing the kernel lock contention problem.

结构提纲

按章节快速跳转。

  1. LinkedIn 遇到了系统反复冻结的问题,需要找出原因并解决。

  2. 通过分析系统日志,发现了系统冻结的原因是内核锁争用

  3. 使用性能监控工具进一步确认了内核锁争用的存在。

  4. 优化了内核中的锁定机制,解决了锁争用问题。

  5. 系统稳定性得到了显著提升,冻结问题减少。

  6. 通过详细的诊断和优化,LinkedIn 成功解决了系统冻结问题。

思维导图

用一张图看清主题之间的关系。

查看大纲文本(无障碍 / 无 JS 友好)
  • LinkedIn 内核锁争用问题

金句 / Highlights

值得收藏与分享的关键句。

#LinkedIn#Kernel Lock Contention#System Stability#Performance Monitoring
打开原文

How LinkedIn Identified a Kernel Lock Contention Issue Causing Recurring System Freezes - InfoQ

[BT](https://www.infoq.com/int/bt/ "bt")

InfoQ Software Architects' Newsletter

A monthly overview of things you need to know as an architect or aspiring architect.

View an example

Enter your e-mail address

Select your country - [x] I consent to InfoQ.com handling my data as explained in this Privacy Notice.

We protect your privacy.

Close

Live Webinar and Q&A: Rethinking AppSec: Why Compiler‑Level Security Changes the Architecture Conversation (Jun 11, 2026)Save Your Seat

Close

Toggle Navigation

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

English edition

[Write for InfoQ](https://www.infoq.com/write-for-infoq/ "Write for InfoQ")

Search

RegisterSign in

Unlock the full InfoQ experience

Unlock the full InfoQ experience by logging in! Stay updated with your favorite authors and topics, engage with content, and download exclusive resources.

Log In

or

Don't have an InfoQ account?

Register

  • Stay updated on topics and peers that matter to youReceive instant alerts on the latest insights and trends.
  • Quickly access free resources for continuous learningMinibooks, videos with transcripts, and training materials.
  • Save articles and read at anytimeBookmark articles to read whenever youre ready.

Logo - Back to homepage

NewsArticlesPresentationsPodcastsGuides

Topics

[Development](https://www.infoq.com/development/ "Development")

  • [Java](https://www.infoq.com/java/ "Java")
  • [Kotlin](https://www.infoq.com/kotlin/ "Kotlin")
  • [.Net](https://www.infoq.com/dotnet/ ".Net")
  • [C#](https://www.infoq.com/c_sharp/ "C#")
  • [Swift](https://www.infoq.com/swift/ "Swift")
  • [Go](https://www.infoq.com/golang/ "Go")
  • [Rust](https://www.infoq.com/rust/ "Rust")
  • [JavaScript](https://www.infoq.com/javascript/ "JavaScript")

Featured in Development

Dany Lepage discusses the architectural journey of porting a hit VR title to seven non-VR platforms. He explains how his team solved the challenges of cross-progression, diverse input paradigms, and maintaining release velocity across Steam, iOS, and PlayStation. Beyond the tech, he shares candid lessons on the "product fit" gap when translating immersive social presence to 2D screens.

![Image 2: From VR to Flat Screens: Bridging the Input and Immersion Gap/presentations/game-vr-flat-screens/en/smallimage/thumbnail-1775637585504.jpg)](https://www.infoq.com/presentations/game-vr-flat-screens)

All in developmentFollow Topic

[Architecture & Design](https://www.infoq.com/architecture-design/ "Architecture & Design")

  • [Architecture](https://www.infoq.com/architecture/ "Architecture")
  • [Enterprise Architecture](https://www.infoq.com/enterprise-architecture/ "Enterprise Architecture")
  • [Scalability/Performance](https://www.infoq.com/performance-scalability/ "Scalability/Performance")
  • [Design](https://www.infoq.com/design/ "Design")
  • [Case Studies](https://www.infoq.com/Case_Study/ "Case Studies")
  • [Microservices](https://www.infoq.com/microservices/ "Microservices")
  • [Service Mesh](https://www.infoq.com/servicemesh/ "Service Mesh")
  • [Patterns](https://www.infoq.com/DesignPattern/ "Patterns")
  • [Security](https://www.infoq.com/Security/ "Security")

Featured in Architecture & Design

Michael Stiefel spoke to Baruch Sadogursky about software architecture in the age of agentic AI. LLM can function, albeit stochastically, as reasoning machines capable of interpreting human ambiguity. With the appropriate rigorous context artifacts to control the LLM’s reasoning, software specifications can become the source of truth, while the code becomes a disposable intermediate language.

![Image 3: Context is the Key to the Agentic Architecture Revolution: a Conversation with Baruch Sadogursky/podcasts/context-key-agentic-architecture-revolution/en/smallimage/the-infoq-podcast-logo-thumbnail-1778747429699.jpg)](https://www.infoq.com/podcasts/context-key-agentic-architecture-revolution)

All in architecture-designFollow Topic

[AI Infrastructure](https://www.infoq.com/ai-ml-data-eng/ "AI Infrastructure")

  • [Big Data](https://www.infoq.com/bigdata/ "Big Data")
  • [Machine Learning](https://www.infoq.com/machinelearning/ "Machine Learning")
  • [NoSQL](https://www.infoq.com/nosql/ "NoSQL")
  • [Database](https://www.infoq.com/database/ "Database")
  • [Data Analytics](https://www.infoq.com/data-analytics/ "Data Analytics")
  • [Streaming](https://www.infoq.com/streaming/ "Streaming")

Featured in AI, ML & Data Engineering

Aaron Erickson discusses the evolution of AI workflows, shifting from "vibe checking" to building reliable, multi-agent frameworks. He explains how to combine deterministic software guardrails with agentic discovery, optimize agent hierarchies, leverage time-series foundation models, and implement rigorous evaluation pyramids to ensure architecture scales effectively in production.

![Image 4: Designing AI Platforms for Reliability: Tools for Certainty, Agents for Discovery/presentations/ai-platforms-reliability/en/smallimage/thumbnail-1779182751443.jpg)](https://www.infoq.com/presentations/ai-platforms-reliability)

All in ai-ml-data-engFollow Topic

[Culture & Methods](https://www.infoq.com/culture-methods/ "Culture & Methods")

  • [Agile](https://www.infoq.com/agile/ "Agile")
  • [Diversity](https://www.infoq.com/diversity/ "Diversity")
  • [Leadership](https://www.infoq.com/leadership/ "Leadership")
  • [Lean/Kanban](https://www.infoq.com/lean/ "Lean/Kanban")
  • [Personal Growth](https://www.infoq.com/personal-growth/ "Personal Growth")
  • [Scrum](https://www.infoq.com/scrum/ "Scrum")
  • [Sociocracy](https://www.infoq.com/sociocracy/ "Sociocracy")
  • [Software Craftmanship](https://www.infoq.com/software_craftsmanship/ "Software Craftmanship")
  • [Team Collaboration](https://www.infoq.com/team-collaboration/ "Team Collaboration")
  • [Testing](https://www.infoq.com/testing/ "Testing")
  • [UX](https://www.infoq.com/ux/ "UX")

Featured in Culture & Methods

Sergiu Petean discusses the strategic journey of evolving DevOps into platform engineering within heavily regulated enterprise environments. He explains how to maximize efficiency using dynamic reference architectures, align platform KPIs directly with board-level business goals, reduce cognitive load via custom team topologies, and maintain innovation sovereignty through open-source technology.

![Image 5: From Legacy to Sovereignty: Driving the Future of Insurance through Platform Engineering/presentations/insurance-platform-engineering/en/smallimage/sergiu-petean-thumbnail-1779181418267.jpeg)](https://www.infoq.com/presentations/insurance-platform-engineering)

All in culture-methodsFollow Topic

DevOps

  • [Infrastructure](https://www.infoq.com/infrastructure/ "Infrastructure")
  • [Continuous Delivery](https://www.infoq.com/continuous_delivery/ "Continuous Delivery")
  • [Automation](https://www.infoq.com/automation/ "Automation")
  • [Containers](https://www.infoq.com/containers/ "Containers")
  • [Cloud](https://www.infoq.com/cloud-computing/ "Cloud")
  • [Observability](https://www.infoq.com/observability/ "Observability")

Featured in DevOps

Joseph Stein discusses engineering an enterprise AI-as-a-Service platform within a private cloud data center. He explains how to maximize underutilized GPU pools via multi-namespace scheduling, leverage Valkey and Lua for atomic priority queuing and backpressure management, mitigate OWASP Top 10 LLM risks via central proxy gateways, and scale batch pipelines using a custom S3-to-Kafka proxy.

![Image 6: Realtime and Batch Processing of GPU Workloads/presentations/realtime-gpu-workloads/en/smallimage/thumbnail-1779194310932.jpg)](https://www.infoq.com/presentations/realtime-gpu-workloads)

All in devopsFollow Topic

[Events](https://events.infoq.com/ "Events")

Helpful links

  • [About InfoQ](https://www.infoq.com/about-infoq "About InfoQ")
  • [InfoQ Editors](https://www.infoq.com/infoq-editors "InfoQ Editors")
  • [Write for InfoQ](https://www.infoq.com/write-for-infoq "Write for InfoQ")
  • [About C4Media](https://c4media.com/ "About C4Media")
  • [Diversity](https://c4media.com/diversity "Diversity")

Choose your language

  • [En](https://www.infoq.com/news/2026/05/linkedin-kernel-lock-freeze/# "InfoQ English")
  • 中文
  • 日本
  • Fr

![Image 7: InfoQ Architect Certification - image Online InfoQ Architect Certification The more senior you become, the fewer people pressure-test your decisions. This 5-week cohort gives you that check. Register Now.](https://certification.qconferences.com/architecture?utm_source=infoq&utm_medium=referral&utm_campaign=homepageheader_onlinecohortarchitecturejune26)![Image 8: QCon AI Boston - image QCon AI Boston Learn how leading engineering teams run AI in production—reliably, securely, and at scale. Register Now.](https://boston.qcon.ai/?utm_source=infoq&utm_medium=referral&utm_campaign=homepageheader_qaiboston26)![Image 9: QCon AI Boston - image Online InfoQ AI Engineering Certification A practical online cohort for senior engineers making decisions around retrieval, agents, evals, and AI infrastructure. Register Now.](https://certification.qconferences.com/ai-engineering?utm_source=infoq&utm_medium=referral&utm_campaign=homepageheader_onlinecohortaijuly26)![Image 10: QCon San Francisco - image QCon San Francisco Learn what's next in AI and software, from teams already doing it. Register Now.](https://qconsf.com/?utm_source=infoq&utm_medium=referral&utm_campaign=homepageheader_qsf26)

[InfoQ Homepage](https://www.infoq.com/ "InfoQ Homepage")[News](https://www.infoq.com/news "News")How LinkedIn Identified a Kernel Lock Contention Issue Causing Recurring System Freezes

[Architecture & Design](https://www.infoq.com/architecture-design/ "Architecture & Design")

Rethinking Logs in the Age of AI Analysis (Webinar Jul 9th)

How LinkedIn Identified a Kernel Lock Contention Issue Causing Recurring System Freezes

May 27, 2026 2 min read

by

Follow

#### Write for InfoQ

Feed your curiosity.Help 550k+ global

senior developers

each month stay ahead.Get in touch

Log in to listen to this article

Audio ready to play

Audio 2

0:00 0:00

Normal 1.25x 1.5x

Like

When LinkedIn engineers encountered short-lived, recurring outages where the database powering their user feed became unavailable and then recover without leaving helpful traces, they had to devise a novel approach to uncover the root cause using _off-CPU profiling_ with eBPF.

As LinkedIn engineer Pratikmohan Srivastav explains, investigating those incidents was especially challenging because they were ephemeral, lasting only 10-15 seconds, and left no useful logs. Additionally, they recurred with no clear pattern and showed no clear external trigger.

A first clue emerged by correlating the incidents with the system memory behavior, which showed that each event coincided with a momentary spike in memory allocation, quickly resolved with the system stabilizing at a higher baseline. Further analysis ruled out other common causes, including CPU throttling, memory fragmentation and compaction, and file I/O.

Thus, the analysis based on conventional monitoring and metrics provided no hits at the root cause of the issue, which prompted LinkedIn engineers to dig deeper into the OS and runtime-level behavior during the freezes. Their approach turned to off-CPU profiling to understand what threads were blocked at the time.

Our solution was to build a trap. We wrote a monitoring script that would automatically capture an off-CPU profile the instant a freeze was detected. The script works as follows:

The script used an eBPF toolkit, BCC, to continuously monitor database health and immediately trigger the BCC offcputime.py profiler to record kernel stack traces of blocked or sleeping threads during 15 seconds. This allowed LinkedIn engineers to capture an off-CPU profile during a live freeze:

This was the key breakthrough - these events were too brief for conventional monitoring to capture the underlying cause, so the only way to observe the root cause was to have profiling instrumentation already in place when the freeze began.

The root cause was traced to a huge memory allocation, around 3.5 GB, which triggered a kernel-level lock on the mmap_lock semaphore, effectively blocking all threads.

Any operation that modifies the process's virtual address space - such as a large mmap allocation - must hold this lock in write mode. While the write lock is held, all other threads that need any memory operation (including madvise for purging, and page fault handling for I/O) are blocked.

Further analysis revealed that the allocation was triggered by Rust in-memory HashMap (pkey_vs_docref), which maps primary keys to internal document references. When it grew past 58,720,256 entries, it hit a resize threshold and doubled in size.

Once the root cause was identified, LinkedIn engineers quickly resolved the issue by pre-allocating the HashMap, thus preventing the resizing during operation. This came at the cost of an additional ~3 GB of resident memory at startup, which proved to be an acceptable trade-off.

This incident highlighted several important lessons, Srivastav says: pre-allocating large data structures can help prevent sudden memory spikes in latency-sensitive paths; eBPF-based off-CPU profiling is a powerful tool for diagnosing “silent freezes” that leave little to no trace; and for ephemeral issues, automated instrumentation that activates on failure conditions can be essential for capturing meaningful diagnostics when the problem occurs.

About the Author

Image 12

#### Sergio De Simone

Sergio De Simone is a software engineer. Sergio has been working as a software engineer for over twenty five years across a range of different projects and companies, including such different work environments as Siemens, HP, and small startups. For the last 10+ years, his focus has been on development for mobile platforms and related technologies. He is currently working for BigML, Inc., where he leads iOS and macOS development.

Show more Show less

#### This content is in the Monitoring topic

Follow Topic

##### Related Topics:

Followers: 4107

Follow Topic

Followers: 10246

Follow Topic

Followers: 8

Follow Topic

Followers: 38

Follow Topic

Followers: 435

Follow Topic

Followers: 87

Follow Topic

Followers: 6

Follow Topic

Followers: 7

Follow Topic

Followers: 20

Follow Topic

Followers: 1

Follow Topic

* #### Related Editorial

* #### Related Sponsors

* #### Related Sponsor

![Image 13: Related sponsor icon/filters:no_upscale()/sponsorship/topic/e8f7c20d-6d29-4b1e-b4ca-291928638812/DatadogWebinarJuly9-RSB-1779204193608.png)](https://www.infoq.com/url/f/3f5cd72d-18dd-444a-a2b9-2d9084f8a894/)

  • July 9, 2026, 12 PM EDT

##### Rethinking Logs in the Age of AI Analysis

Presented by: Nicolas Jung - Product Manager, Logs at Datadog

SPONSORED BY DATADOG Save your seat

Related Content

Mar 18, 2026 ![Image 14: Icon image/articles/practical-robustness-going-beyond-memory-safety-rust/en/smallimage/beyond-memory-safety-what-makes-rust-different-lessons-from-autonomous-robotics-thumb-1773646654979.jpg)](https://www.infoq.com/articles/practical-robustness-going-beyond-memory-safety-rust/)

May 19, 2026

May 11, 2026

May 06, 2026

May 19, 2026 ![Image 15: Icon image/articles/ebpf-for-security-observability/en/smallimage/ebpf-for-security-observability-thumbnail-1778674557176.jpg)](https://www.infoq.com/articles/ebpf-for-security-observability/)

Apr 30, 2026

Apr 29, 2026

Apr 28, 2026

Mar 09, 2026 ![Image 16: Icon image/articles/change-metrics-system-reliability/en/smallimage/change-metrics-systems-reliability-thumbnail-1772787617464.jpg)](https://www.infoq.com/articles/change-metrics-system-reliability/)

Related Sponsors

Logs have long been a reactive slog during incidents. AI is making telemetry volumes explode — but also offers a solution. Learn how to advance from fragmented logging to AI-powered platforms with faster investigations and smarter spend.

  • Sponsored by

![Image 18: Icon image/filters:no_upscale()/sponsorship/topic/e8f7c20d-6d29-4b1e-b4ca-291928638812/DatadogWebinarJuly9-RSB-1779204193608.png)](https://www.infoq.com/url/f/3f5cd72d-18dd-444a-a2b9-2d9084f8a894/)

Related Content

Feb 20, 2026 ![Image 19: Icon image/presentations/webassembly-extensions/en/smallimage/thumbnail-alex-radovici-1770817232673.jpeg)](https://www.infoq.com/presentations/webassembly-extensions/)

Feb 04, 2026 ![Image 20: Icon image/articles/agent-assisted-intelligent-observability/en/smallimage/agent-assisted-intelligent-thumbnail-1769595476571.jpg)](https://www.infoq.com/articles/agent-assisted-intelligent-observability/)

May 16, 2026

May 12, 2026

Apr 22, 2026 ![Image 21: Icon image/articles/sovereign-fault-domains-cloud-resilience/en/smallimage/sovereign-fault-domains-cloud-resilience-thumbnail-1776430533702.jpg)](https://www.infoq.com/articles/sovereign-fault-domains-cloud-resilience/)

May 27, 2026

**The InfoQ** Newsletter

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

Enter your e-mail address

Select your country - [x] I consent to InfoQ.com handling my data as explained in this Privacy Notice.

We protect your privacy.

  • ##### [Pip 26.1 Ships Dependency Cooldowns and Experimental Lockfile Support to Combat Supply Chain Attacks](https://www.infoq.com/news/2026/05/pip-261-dependency-cooldowns/ "Pip 26.1 Ships Dependency Cooldowns and Experimental Lockfile Support to Combat Supply Chain Attacks")
  • ##### [Cloudflare and Stripe Let AI Agents Create Accounts, Buy Domains, and Deploy to Production](https://www.infoq.com/news/2026/05/cloudflare-stripe-agent-commerce/ "Cloudflare and Stripe Let AI Agents Create Accounts, Buy Domains, and Deploy to Production")
  • ##### [Google Introduces Cloud Fraud Defense as Successor to reCAPTCHA](https://www.infoq.com/news/2026/05/cloud-fraud-defense-recaptcha/ "Google Introduces Cloud Fraud Defense as Successor to reCAPTCHA")
  • ##### [How LinkedIn Identified a Kernel Lock Contention Issue Causing Recurring System Freezes](https://www.infoq.com/news/2026/05/linkedin-kernel-lock-freeze/ "How LinkedIn Identified a Kernel Lock Contention Issue Causing Recurring System Freezes")
  • ##### [Uber Improves Restaurant Recommendations Using Real-Time Signals and Listwise Ranking](https://www.infoq.com/news/2026/05/uber-eats-ranking-system/ "Uber Improves Restaurant Recommendations Using Real-Time Signals and Listwise Ranking")
  • ##### [Designing a Multi-Agent System for Engineering Support at Scale: a Case Study from Grab](https://www.infoq.com/news/2026/05/grab-multi-agent-support-system/ "Designing a Multi-Agent System for Engineering Support at Scale: a Case Study from Grab")
  • ##### [From Legacy to Sovereignty: Driving the Future of Insurance through Platform Engineering](https://www.infoq.com/presentations/insurance-platform-engineering/ "From Legacy to Sovereignty: Driving the Future of Insurance through Platform Engineering")
  • ##### [How Platform Engineering Using Golden Bricks Can Enable Fast and Smooth Delivery](https://www.infoq.com/news/2026/05/platform-golden-bricks/ "How Platform Engineering Using Golden Bricks Can Enable Fast and Smooth Delivery")
  • ##### [Product Thinking for Cloud Native Engineers](https://www.infoq.com/presentations/product-cloud-native/ "Product Thinking for Cloud Native Engineers")
  • ##### [Designing AI Platforms for Reliability: Tools for Certainty, Agents for Discovery](https://www.infoq.com/presentations/ai-platforms-reliability/ "Designing AI Platforms for Reliability: Tools for Certainty, Agents for Discovery")
  • ##### [Sarang Kulkarni on Lessons from Building Deep Research Agents in Production](https://www.infoq.com/news/2026/05/kulkarni-deep-research-agents/ "Sarang Kulkarni on Lessons from Building Deep Research Agents in Production")
  • ##### [InfoQ Online Certification Program: New AI Engineering and Organizational Architecture Cohorts](https://www.infoq.com/news/2026/05/online-cohort-certification-prog/ "InfoQ Online Certification Program: New AI Engineering and Organizational Architecture Cohorts")
  • ##### [Platform Engineering Labs Expands formae with Kubernetes Support, Native Helm Integration](https://www.infoq.com/news/2026/05/formae-k8s-helm-integration/ "Platform Engineering Labs Expands formae with Kubernetes Support, Native Helm Integration")
  • ##### [Realtime and Batch Processing of GPU Workloads](https://www.infoq.com/presentations/realtime-gpu-workloads/ "Realtime and Batch Processing of GPU Workloads")
  • ##### [Discord Rebuilds Database Operations around Automation to Manage ScyllaDB at Massive Scale](https://www.infoq.com/news/2026/05/discord-scylladb-automation/ "Discord Rebuilds Database Operations around Automation to Manage ScyllaDB at Massive Scale")

**The InfoQ** Newsletter

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

  • Get a quick overview of content published on a variety of innovator and early adopter technologies
  • Learn what you don’t know that you don’t know
  • Stay up to date with the latest information from the topics you are interested in

Enter your e-mail address

Select your country - [x] I consent to InfoQ.com handling my data as explained in this Privacy Notice.

We protect your privacy.

**ONLINE INFOQ CERTIFICATION PROGRAM** A Cohort for Senior Engineers and Architects * **Focused on ARCHITECTURE** with Luca Mezzalira | JUNE 10 * **Focused on AI ENGINEERING** with Hien Luu | JULY 25 Bring a real architecture or AI engineering challenge from your work. Spend 5 weeks pressure-testing your approach with senior peers from other companies and experienced facilitators. Explore the upcoming cohorts. **Register Now.**

#### Events

June 1-2, 2026

June 10, 2026

July 25, 2026

November 16-20, 2026

#### Follow us on

Youtube 232K FollowersLinkedin 26K FollowersInstagram NewRSS 19K ReadersX 57.1k FollowersFacebook 21K LikesBluesky New

#### Stay in the know

The InfoQ Podcast![Image 22: The InfoQ Podcast Logo - Stay in the know](https://www.infoq.com/podcasts/)Engineering Culture Podcast![Image 23: Engineering Culture Podcast Logo - Stay in the knoww](https://www.infoq.com/podcasts/#engineering_culture)The Software Architects' Newsletter![Image 24: The Software Architects' Newsletter Logo - Stay in the know](https://www.infoq.com/software-architects-newsletter/)

General Feedback [feedback@infoq.com](mailto:feedback@infoq.com) Advertising [sales@infoq.com](mailto:sales@infoq.com) Editorial [editors@infoq.com](mailto:editors@infoq.com) Marketing [marketing@infoq.com](mailto:marketing@infoq.com)

InfoQ.com and all content copyright © 2006-2026 C4Media Inc.

Privacy Notice, Terms And Conditions, Cookie Policy

Close

[BT](https://www.infoq.com/int/bt/ "bt")

AI 可能会生成不准确的信息,请核实重要内容

How LinkedIn Identified a Kernel Lock Contention Issue Causing Recurring System Freezes | InfoQ | traeai