A Day Gone Wrong and Right
Yesterday was some goofy day. I had 2 days of HF stuff I was behind on. Lots of PMs, emails, and orders to catch up with. I had a good night of sleep though and was ready to take on the day. Within 10 mins of turning on the computer the wife walks in asking if I wanted to go out for breakfast.
A smart man doesn’t say no to his wife very often and besides that I could use some grub. I figured I’d be back in 90 mins or less.
So we headed out and all went well but I did get an afternoon Bloody Mary and that was delicious. Not much alcohol really which was fine.
Oh I forgot to mention that the “good night of sleep” meant I slept till about noon. Sometimes you just gotta pull the covers over your head when the sun comes up and keep on sleeping.
So breakfast was great and we headed out but then the wife says let’s get some groceries. I sighed and tried to just go home but truth is we needed some milk and something for dinner. So off we went to get groceries.
After we got our few bags of goodies it’s nearly 3pm which means it’s time to get my kids from school. And that’s exactly what we did next.
Oh another thing I forgot to mention. Soon as I woke up I made a whole pot of coffee and this entire time it’s just sitting there waiting for me. I love my coffee and the days I drink a pot are the days I do lots of work. I was preparing to do a heavy load.
Nearly 4pm I’m finally at home. I grab my coffee, sat down, and began to see what’s in front of me to do.
Apparently site was having some intermittent downtime and first thing caught my eye was a thread about Cloudflare problems. Now I have monitors all over the place and sure enough something did appear amiss. Uptime is high on the priority list and so I go take a look.
Most things seem okay but the past week or so a few nagging DDOS attacks have caused minor disruptions but these are annoying to everyone and to be honest I could do something about it. So I did. I enabled a few modules I have running and did some adjustments to rules. Immediately caught a few IPs working past DDOS protection at CF. “Excellent”, I thought.
While in SSH on our server I was reviewing some logs and noticed an error for a seg fault from Apache. Well that shouldn’t have been there. It kept popping up too. That needed investigation. And it’s also where the day just goes wrong. And I bet you thought this story was about breakfast.
Apparently this seg fault error has been going on for some time and I just didn’t notice it. A seg fault basically means a process died from an error instead of gracefully being shut down. I had an A-HA moment thinking I stumbled onto something that might be why serving hasn’t been so amazing lately. It’s been fine but moments of server hiccups happen and a refresh is required to get the page to load. It happens and I wasn’t overly concerned but here I was sitting there in front of the error. No proper sys admin would ignore this situation. I had to fix it.
Some quick Googling and a few possibilities were in front of me. One was hardware failure. Another was script error. Lastly it could be a bad module for Apache or PHP.
I decided to work first with modules and packages and proceeded to update everything I could on the server which might be the conflict. I’m really familiar with this stuff so it wasn’t too bad. Just some compiles and installs and a few config alterations. Nothing that I was a problem to handle. One module didn’t work though and for a small time the site displayed an Xcache error. That took me 15-20 mins to fix. But overall I got everything updated and working. The problem however didn’t go away and the error in the logs was still happening.
Now it’s been a few hours and it’s getting later and later and this error is getting on my nerves now.
I was praying that we didn’t have a hardware error because actually those are harder to troubleshoot. You could have bad ram, bad drive sector, faulty mobo, or even your CPU could be failing. So normally you end up having your datacenter just swap it all out but a new drive would mean a new rebuild too. Not what I wanted to do anytime soon. But out of obligation I had to run some memory tests which were all passed just fine. I really didn’t think the issue was hardware though consider the error was very consistent. Wait a minute…hold on here. Let’s take a peek again at the error.
The error actually coincided with our deny actions inside htaccess. Reviewing the server error log along with site error log and sure enough. They matched. This meant that the deny process or something to do with it was causing a seg fault. I felt I was onto something now. Finally was narrowing this down after hours of typing away.
Opening htaccess it all appeared rather normal but maybe there was one line causing problems. I actually had some weirdness with htaccess before but I won’t bother you with details on that. I started to remove lines. No effect. I remove more lines. No effect. I removed just about everything. The error stops. Yup, that’s right. The error stops. Something inside htaccess is causing a process ending in a seg fault.
After hours updating my packages and modules I was right back with Apache reviewing conf files and trying to figure out if I had something I shouldn’t. Or maybe it was a setting I had goofed. I do so much custom stuff I wouldn’t be surprised if I did goof it up somewhere. At least I knew I was onto something.
Surprisingly this entire time I’m fairly calm about the work. I drank my coffee. I was enjoying the challenge. But after many hours I was starting to get worn out. I closed my SSH session. I had some dinner but it was getting late. I felt I was done for the night and played some HackerCraft and answered some PMs.
This is where being a sysadmin gets into your blood though. Wasn’t long before I had some ideas on the problems after my short break and opened SSH again. This time I was going to figure it out.
I toyed with the htaccess for a long time trying different ways to write the file and different options with Apache confs. I felt I was finally getting somewhere. Then around 1am I was confident I had the error nailed and eradicated. Our configs and setup were not only still intact but everything was updated. We were actually not only fixed but better than before I started. This is where being a sysadmin is rewarding. It’s a wonderful feeling to achieve.
But wait a second. Didn’t I still have two days of work piled up? Because even though I just busted my bum to get rid of an error and upgrade parts of the server I still have those PMs, emails, and orders to deal with. Boy I was very tempted to log off and get some sleep but actually the coffee I had earlier was still giving me some juice. Enough so that one by one I took care of it all. All orders were done. All the PMs were replied to. All the stickies taken care of. It wasn’t till 4am that I really got off the keyboard and tried to get some sleep. It was some day I tell you.
Waking up the next day my first priority was making sure the error was still gone. And it was. But I wanted to share my story with everyone because I think often when I have days like this it appears I’m not working hard or maybe I’m not concerned about HF. When in fact that’s the furthest from the truth. Even on a day I don’t post much. Or a day I’m not answering PMs. I’m still here working to improve our site.
Now I have 35 PMs to answer and some posts to make. Thanks for reading as usual and see you all around the forum.