Yeah, get back to me when you have data...


User: “Hey, my SQL queries are timing out on the replica!”

J: “Hmmm, that error says they’re so slow that the master is trying to clean up all the rows that have changed since it started running.”

U: “Can we try it on the primary?”

J: “No, because 1) Production, and 2) error is specific to replicas.”

U: (CCs Partner)

Partner: “Our service that queries your DB, sends the results over a VPN tunnel, and ingests them into our system is working fine, and doesn’t show any delays or network issues.”

J: “That’s what you said last time, and the problem ‘just went away’ the next night.”

P: “Try running this specific query locally and tell me how it works.”

J: (examines 450-line SQL query, shrugs, runs it) “2.5 minutes.”

P: “Hmm, works here with ‘limit 10000’ and fails with ‘limit 50000’. Well, nothing we can do on our end! Shall I set up a call with our Engineering team?”

J: “Wait, what was the runtime for that query on your end before it started timing out this week? For that matter, what was the runtime when it worked with ‘limit 10000’?”

[update]

P: “Here’s a chart showing it bang-on at almost that exact same runtime for weeks, until it started timing out every single time on Friday night.”

J: “Okay, let us know when you’ve fixed that. Just for fun, try changing the query to just return the count of rows instead of the ~36 MB of data. I, um, have a hunch.”

[update]

U: “9.6 seconds.”

J: “So, you can successfully submit obnoxious queries through Partner’s interface, as long as they don’t return any significant quantity of data. Hmmm, what does that sound like?”

[update]

(long meeting full of fingerpointing with no indication of how it started failing 100% of the time like throwing a switch)

J: “I have finally managed to reproduce the failure locally, which means I’m willing to try a small config-file change to work around the problem. Reminder: we still have no hint as to the actual cause.”

[you are here]


Comments via Isso

Markdown formatting and simple HTML accepted.

Sometimes you have to double-click to enter text in the form (interaction between Isso and Bootstrap?). Tab is more reliable.