subject: Finding Root Cause In A Production Environment [print this page] No one likes application outagesNo one likes application outages. End users are furious, IT departments are stressed, and line of business people are angry about losing money. Minimizing downtime is an important goal for any IT department, and yet outages often last for hours at a time. Why is it that so many organizations have such slow troubleshooting processes?
The problem is that the team doesnt know where to look. This is especially true if you dont have any application performance monitoring tools are your disposal. These organizations often take a carpet-bombing approach to troubleshooting application problems they send an email to everyone remotely connected to the application, and then wait for someone to fess up.
That usually doesnt work for a whole host of reasons that Im not going to get into. The point is this: you need to go looking for problems yourself. If you dont have a web performance monitoring tool, this is going to be pretty difficult.
This is because you need deep diagnostics to find root cause, and you need it in production (not a debugger). With code-level diagnostics you can instantly access the complete code execution and timing of slow user requests in production. Locating the exact line of code responsible for a performance issue means Operations and Developers solve outages faster complete code visibility allows you to troubleshoot in minutes rather than days or weeks. Here are a few examples of problems that were found and solved using a Java performance monitoring tool:
1. Slow SQL Statement
Industry: Education
Pain: Key Business Transaction with 5 sec response times
Root Cause: Slow JDBC query with full-table scan
2. Slice of Death in Cassandra
Industry: SaaS Provider
Pain: Key Business Transaction with 2.5 sec response times
Root Cause: Slow Thrift query in Cassandra
3. Slow & Chatty Web Service Calls
Industry: Media
Pain: Several Business Transactions with 2.5 min response times
Root Cause: Excessive Web Service Invocation (5+ per trx)
4. Extreme XML Processing
Industry: Retail/E-Commerce
Pain: Key Business Transaction with 17 sec response times
Root Cause: XML serialization over the wire.
5. Mail Server Connectivity
Industry: Retail/E-Commerce
Pain: Key Business Transaction with 20 sec response times
Root Cause: Slow Mail Server Connectivity
6. Slow Security 3rd Party Framework
Industry: Education
Pain: All Business Transactions with > 3 sec response times
Root Cause: Slow 3rd party code
7. Excessive SQL Queries
Industry: Education
Pain: Key Business Transactions with 2 min response times
Root Cause: Thousands of SQL queries per transaction
8. Commit Happy
Industry: Retail/E-Commerce
Pain: Several Business Transactions with 25+ sec response times
Root Cause: Unnecessary use of commits and transaction management.
If you want to manage and troubleshoot application performance in production, you need to be able to get to root cause. The only way to get to root cause quickly and effectively is with a performance monitoring tool if youre not using one yet, youre probably in unnecessary pain.