|Meet the Gang 1 2 3 4 5 6 7 8 9|
There is no guarantee that your questions here will ever be answered. Readers at confidential sites must provide permission to publish. However, you can be published anonymously - just let us know!
TAG Member bios | FAQ | Knowledge base
Answered By Thomas Adam, Mike Ellis, Ben Okopnik, Huibert Alblas
My linux machine is crashing randomly once every couple of days - it freezes up and will not respond to anything (including ctrl-alt-del, or ping from another machine) except the on/off switch. The load on the machine is light, and the work it is doing is not particularly unusual.
1) Can anyone suggest how I could gather useful information about what is going on?
I put a line like this in /etc/syslog.conf:
As far as I understand it, this should get all possible debugging information out of syslogd, although I'm not completely clear whether any more could be squeezed out of klogd. In any case, I'm not getting any messages around the time of a crash. I've also turned on all the logging options that I can find in the processes that I am running, without any helpful effect.
[Thomas] Have you added any memory to your machine recently?? This has been known to "crash" machines randomly.
What programs do you have running on default?? Perhaps you could send me (us) an output of the "pstree" command so that we can see which process is linked to what.
[Mike] Quite right, Thomas. If you have two or more memory modules (DIMMs probably) in your machine, try removing one of them if you can. If the fault goes appears to go away, try putting the module back in and see if the fault re-appears. If the fault never goes away, replace the first module and removing another and try again.
As you're running a 2.4 kernel, make sure you have plenty of swap. Sadly the 2.4 kernels aren't as good as the older 2.2 and making maximum use of swap, with the result that you are now strongly recommended to... look at http://www.linuxgazette.com/issue62/lg_tips62.html#tips/12 if you need help. I haven't heard tales of this causing random lock-ups, but you never know!
[Halb] Yes, the early 2.4 kernels had 'some' trouble with swap space. But at the time of 2.4.9 a completely new ( build from scratch ) VM was introduced by Andrea Arcangeli, and incorperated by Linus since 2.4.10.
- You can read a good story on:
It is an interresting, not too long story.
However, if you're using the new tmpfs, it might be wise to err on the side of generosity when allocating swap space. Using tmpfs, your /tmp (and/or /var/tmp or other designated directories) can be sharing space with your swap (kernel VM paging).
Still, one or two swap partitions of 127Mb should be plenty for most situations. I still like to keep my swap partitions smaller than 127Mb (the historical limit was 128, but cylinder boundaries usually round "up"). I also recommend putting one swap partition on each physical drive (spindle) to allow the kernel to balance the load across them (small performance gain, but neglible cost on modern hard disks).
2) If I can get any usable information about the problem, does anyone know where I should send it?
[Thomas] Here, to both me and the rest of TAG.
If I knew that it was a kernel problem, I'd try the linux-kernel mailing list. But that looks pretty intimidating, so I'd want to be sure I knew what I was talking about first! Also, I guess that some kind of hardware problem is more likely.
[Thomas] I'm still edging my bets on memory...if it is a Kernel problem then you could try to re-compile it using the latest stable release.
I'm using Red Hat 7.2, which includes the 2.4.7-10 kernel, on a machine with an Intel Pentium 4 CPU running at 1.5 GHz and 512M of RAM. Crashes occur even when I am not running X and no users are logged on. The main process that I am running is the Jakarta Tomcat web server, which runs a Java servlet, which runs the symbolic mathematics program Maple as an external process. As far as I can tell from the logs, when the last crash occurred, there had been no request to the web server for some time. It's just possible that a request triggered the crash, which prevented the request from being logged, but I doubt it.
Thanks in advance for any suggestions.
[Thomas] I might also suggest that you run the "strace" commands on processes you think might be crashing. That will then tell you where and how...if nothing else.
[Ben] I'm pretty much of the same mind as Thomas on this one; Linux is pretty much bullet-proof, what tends to cause crashes of this sort is hardware - and that critical path doesn't include too many things, particularly when the key word is "random". Memory would be the first thing I'd suspect (and would test by replacement); the hard drive would be the second. I've heard of wonky motherboards causing problems, but have never experienced it myself. I've seen a power supply cause funky behavior before - even though that was on a non-Linux system, it would be much the same - and... that's pretty much it.
"strace", in my opinion, is not something you can run on a production system. It's great for troubleshooting, but running a web server under it? I just tried running "thttpd" under it, and it took approximately 30 seconds just to connect to the localhost - and about 15 more to cd into a directory. Not feasible.
[Thomas] Hum, perhaps I wasn;t too clear on that point. What I meant was that he should run strace on only one process which he thinks might be causing the crash. Hence the reason why I initially asked for his "pstree" output.
But I agree, strace is not that good when trying to analyse a "labour intensive" program such as a webserver, but then I fail to see the need as to why one would want to run "strace" on such a program anyway....afterall, Apache is stable enough
Thanks again for all your help.
[Mike & Ben] You're welcome.
Memory would be the first thing I'd suspect (and would test by replacement);
I downloaded memtest86 (from http://www.teresaudio.com/memtest86) and ran through its default tests twice (that took about 40 minutes - I haven't yet tried the additional tests, which are supposed to take four or five hours, altogether). Nothing came up. Do you think that's reliable, or would you test by replacement anyway?
[Mike] The problem may be an intermittent fault: if the tests take 40 minutes and the machine usually runs for (say) 4 days, you've effectively given it less than a 1% chance of finding the problem [40/(4*24*60)]. I'd still seriously consider a test by replacement and/or removal of DIMMs.
[Ben] My rule of memory testing, for many years now, has been "a minimum of 24 hours - 48 is better - and hit it with freeze spray at the end." For a system that needs to be up and running, however, "shotgunning" (wholesale replacement of suspect hardware) is what offers the highest chance of quick resolution.
the hard drive would be the second
I've seen a power supply cause funky behavior before
These don't sound like easy things to test . Do you have any suggestions?
[Mike] They aren't, sadly. Testing by replacement is really the best option for these sorts of problems, but beware, we had a machine here with a dodgy PSU recently which cost us a lot more than a new PSU )-: By the time we'd tracked down the problem we had...
- three suspect hard-drives
- two suspect 128M DIMMs
- two suspect motherboards
- two suspect PIII processors
- one suspect network card
- one suspect video card
- one suspect CD-ROM drive
- one suspect floppy drive
- one suspect keyboard
- one suspect mouse
- and a partridge in a pear tree
The whole lot had to be disposed of because we had used the faulty PSU with them, and the fault was that it generated occasional over-volt spikes during power-up. These potentially weakened any or all of the other components in the system rendering them unsuitable for mission-critical applications (we actually purchased a cheap case, marked all the bits as suspect and built them into a gash machine for playing with).
In your case, try cloning the hard-drive and replacing that. You can use dd to clone the drive - dd if=/dev/current_hard_disc of=/dev/new_hard_disc bs=4096 - assuming the hard-drives are the same size. Don't use the partitions, though - /dev/hda and /dev/hdc will work, /dev/hda1 and /dev/hdc1 won't since the partition table and MBR won't be copied. Using the raw devices will also copy any other partitions if you've got them.
<Ding/> One bright idea that has just occurred to me - are you using any external devices? If, for example, you've got an external SCSI scanner on the same chain as your internal SCSI discs, a dodgy connection or termination could potentially cause random crashes. It might also be worthwhile checking any USB or fireware devices you've got connected. I doubt serial or parallel devices would cause a problem, but it might be worth checking just in case. Internal connections are also suspect - a CD-ROM drive on the same IDE chain as your boot disc might cause problems: you might even like to remove it completely if you don't use it often. Any PCI cards are also candidates for suspicion - make sure they're all plugged in fully.
Let us know how you get on!
[Ben] Unfortunately, all my best suggestions come down to the above two. I used to look for noise in power supply output with an oscilloscope - interestingly enough, it was a fairly reliable method of sussing out the problematic ones - but I suspect that it's not a common skill today. There are a number of HDD testers out there, all hiding behind the innocuous guise of disk performance measurement tools... but Professor Moriarty is not fooled.
Seriously, if running one of those (e.g., "bonnie++") for a few hours doesn't make your HDD fall over and lie there twitching, you're probably all right on that score.
|Meet the Gang 1 2 3 4 5 6 7 8 9|