Managed Services Incident Response During CrowdStrike Outage

  • July 29, 2024

Background

Over the weekend, a major building products manufacturer experienced a critical system outage due to a widespread CrowdStrike issue that impacted virtual machines (VMs) globally. Thanks to being under the managed services of Magic Software, the crisis was swiftly managed and resolved by our dedicated team in India while it was still night in the US.

The Incident

At approximately 10:27 AM IST, the proactive monitoring system at Magic Software detected an outage alert from one of the manufacturer’s VMs. The affected VM caused the application it hosted to become unavailable, halting critical operations on the factory floor, which runs 24/7.

Immediate Response

Alert Detection: The monitoring system flagged the unavailability, triggering immediate action from the Magic Software team.

Rapid Coordination: Within minutes, the Indian support team coordinated with the DevOps and Infrastructure teams to identify the cause. It was quickly determined that the outage was due to an issue with CrowdStrike’s cloud servers, causing VMs to be unavailable.

Technical Challenges: The team faced difficulties as the VMs would not boot normally into in safe mode, complicating the cleanup of problematic CrowdStrike patches.

Resolution Steps

Main plan: The team employed a fast approach by detaching the affected disks, cleaning up the patches, and then reattaching the disks to the VMs. This method successfully bypassed the issues caused by the CrowdStrike agent files.

Backup Plans: Simultaneously, a clean copy of the affected VM was restored from the backup system and was prepared to ensure a fallback option. Although not needed, this plan highlighted the team’s thorough preparedness.

Communication: Throughout the process, the Customer Success Manager maintained regular communication with the manufacturer’s representatives, providing updates every 30 minutes to keep them informed and reassured.

Throughout the process, the Inbound Customer Success Manager maintains regular communication with the Outbound Customer Success Manager, providing updates every 30 minutes in order to keep our customers updated.

Outcome

Within 5 to 6 hours, the Magic Software team restored the affected systems, ensuring minimal downtime and making it available for the customer to resume work. The factory floor operations were back to normal, with continuous monitoring to confirm stability.

Impact

Operational Continuity: The swift response and resolution ensured that the manufacturer’s production, reliant on continuous data flow, experienced minimal disruption.

Cost Savings: Preventing prolonged downtime saved the manufacturer significant potential losses in production and operational costs.

Client Confidence: The transparency and efficiency demonstrated by Magic Software reinforced the manufacturer’s trust in the managed services provided.

Conclusion

This incident underscores the critical value of Magic Software’s managed services. The combination of proactive monitoring, technical expertise, and effective communication ensured that a potentially catastrophic event was managed with minimal impact. This use case exemplifies the robust support system and quick problem-solving capabilities that Magic Software offers to its clients.

Featured Blog Posts

Top 5 Challenges of Cloud Integration and How to Overcome Them

Read Story

Navigating the Cloud: A Comprehensive Guide to Cloud Integration

Read Story

Unlocking AI Potential: The Power of High-Quality Data in Engineering

Read Story