Five Steps to Rescue a Linux Server That Collapsed

Five Steps to Rescue a Linux Server That Collapsed

We’ve seen a lot of Linux-servers, which, without a single reboot, worked for years, in 24×7 mode. But no computer is immune from surprises, which can lead to “Iron”, software and network failures. Even the most reliable server can refuse one day. What to do? Today you will learn about what should be done first to find out the cause of the problem and get the car back on track.

And, by the way, right at the beginning, right after the failure, it is necessary to answer a very important question: “Is the server guilty for what happened?”. It is possible that the source of the problem is not at all in it. But, we will not run ahead.

Troubleshooting: Before and Now

When, in the 1980s long before Linus Torvalds caught the idea of Linux – if there was something wrong with the server, it was a real ambush. Then there were relatively few tools for finding problems, so it would take a long time for the failed server to work again.

Now everything is completely different. Somehow one system administrator quite seriously told me, speaking about the problem server: “I destroyed it and picked up a new one.”

In the old days, this sounded wild, but today, when IT infrastructure is built on the basis of virtual machines and containers … In the end, deploying new servers as needed – this is normal in any cloud environment.

  • Here you need to add DevOps tools, such as Chef and Puppet, which make it easier to create a new server than to diagnose and “repair” the old one. And if we talk about such high-level tools as Docker Swarm, Mesosphere, and Kubernetes, thanks to them the working capacity of the failed server will be automatically restored before the administrator finds out about the problem.
  • This concept has become so widespread that it was given the name of “heartless computing.” Among the platforms that provide similar features – AWS Lambda,, Google Cloud Functions.
  • With this approach, the cloud service is responsible for administering servers, resolving scaling issues and a host of other tasks in order to provide the client with the processing power required to run its applications.

Serverless computing, virtual machines, containers – all these abstraction layers hide real servers from users, and, to some extent, from system administrators. However, at the heart of it all – physical hardware and operating systems. And, if something at this level suddenly breaks down, someone should put everything in order. That is why what we are talking about today will never lose its relevance.

  • I remember talking to one system operator. Here’s what he said about how to act after a failure: “Reinstalling the server is the way to go. So do not understand – what happened to the car, and how to prevent this in the future. No decent administrator does that. ” I agree with this. Until the source of the problem is found, it can not be considered solved.

So, we got the server that crashed, or we, at least, suspect that the source of trouble is in it. I propose to go through five steps together, from which it is worth starting to search and solve problems.

Step one. Hardware check

First of all – check the hardware. I know that it sounds trivial and not modern, but, anyway – do it. Get up from the chair, go to the server rack and make sure that the server is properly connected to everything necessary for its normal operation.

I can not count how many times the search for the cause of the problem led to cable connections. One look at the LEDs – and it becomes clear that the Ethernet cable is pulled out, or the server power is turned off.

Of course, if everything looks more or less decent, you can do without a visit to the server and check the state of the Ethernet connection with this command:

$ sudo ethtool eth0

If its answer can be interpreted as “yes”, it means that the interface being examined can exchange data over the network.

However, do not neglect to personally inspect the device. This will help, for example, to find out that someone has pulled out some important cable and thus de-energized the server or the entire rack. Yes, it’s ridiculously simple, but surprising – how often the reason for the system’s failure is precisely this.

Another common hardware problem with the unaided eye is not to recognize. So, faulty memory causes all sorts of problems.

Virtual machines and containers can hide these problems, but if you are faced with a regular occurrence of failures associated with a particular physical dedicated server, check its memory.

In order to see that the BIOS / UEFI reports the hardware of the computer, including memory, use the demi decode command:

$ sudo dmidecode --type memory

Even if everything seems normal here, in fact, it may not be so. The fact is that the SMBIOS data is not always accurate. Therefore, if after demi decode memory still remains under suspicion – it’s time to use Memtest86. This is a great program for testing memory, but it works slowly. If you run it on a server, do not expect to be able to use this machine for anything else until the scan is complete.

If you encounter a lot of memory problems – I’ve seen this in places with unstable power – you need to download the Linux kernel module edac_core. This module constantly checks the memory in the search for bad sectors. To load this module, use this command:

$ sudo modprobe edac_core

Wait for a while and see if you can see something by doing this command:

$ sudo grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count

This command will give you a summary of the number of errors broken down by memory modules (indicators whose name begins with crow). This information, if you compare them with the dm ide code data about memory channels, slots, and component serial numbers, will help to identify a bad memory bar.

Step two. Finding the true source of the problem

So, the server began to behave strangely, but the smoke from it yet does not go. Is it on the server? Before you try to solve the problem, you must first determine its source. For example, if users complain of strangeness with a server application, first check that the cause of the problem is not a malfunction on the client.

For example, a friend once told me how his users reported that they can not work with IBM Tivoli Storage Manager. At first, of course, it seemed that the whole server was guilty. But in the end, the administrator found out that the problem was not connected to the server part at all. The reason was the unsuccessful patch of Windows client 3076895. But how this security update failed, made it look like a server side problem.

In addition, you need to understand whether the server itself is the cause of the problem or the server application. For example, a server program can work something like, and the hardware is in perfect order.

To begin with, the most obvious. Does the application work? There are many ways to verify this. Here are two of my favorites:

$ sudo ps -ef | grep apache2
$ sudo netstat -plunt | grep apache2

If it turns out that, for example, the Apache web server does not work, you can run it with this command:

$ sudo service apache2 start

If in a nutshell, then before you diagnose the server and look for the cause of the problem, find out whether the server is to blame, or something else. Only when you understand exactly where the source of the failure lies, you can ask the right questions and move on to further analysis of what happened.

This can be compared with an unexpected car stop. You know that the car does not go any further, but before you bring it into service, it’s good to check if there is petrol in the tank.

Step three. Using the top command

So, if it turns out that all paths lead to the server, then here is another important tool for testing the system – the top command. It allows you to find the average load on the server, use the swap file, find out what resources the system uses processes. This utility shows general information about the system and displays data on all running processes on the Linux server. Here is a detailed description of the data that this command displays. Here you can find a lot of information that can help in finding problems with the server. Here are some useful ways to work with top, allowing you to find problem areas.

In order to detect the process consuming the most memory, the list of processes must be sorted in interactive mode by entering from the keyboard M. In order to find out the application consuming the most CPU resources, sort the list by typing P. To sort the processes by the time of activity , Enter it from the keyboard T. In order to see the column for sorting better, press the b key.

In addition, process data output by the command in interactive mode can be filtered by entering O or o. The following prompt appears, where you are prompted to add a filter:

add filter #1 (ignoring case) as: [!]FLD?VAL

Then you can enter a template, say, for filtering on a specific process. For example, thanks to the COMMAND = Apache filter, the program will only display information about Apache processes.

Another useful feature of the top is the output of the full path of the process and the arguments of the start. To view this data, use the c key.

Another similar top feature is activated by the V symbol. It allows you to switch to the hierarchical output mode of process information.

In addition, you can view the processes of a particular user using u or U keys, or hide processes that do not consume CPU resources by pressing the I key.

Although top has long been the most popular interactive Linux utility for viewing the current situation in the system, it has alternatives. For example, there is a program top has an extended set of features, which is more simple and convenient graphical interface Ncurses. Working with top, you can use the mouse and scroll the list of processes vertically and horizontally in order to view their full list and complete command lines.

I’m not expecting the top to tell me – what’s the problem. Rather, I use this tool to find something that makes me think: “And this is already interesting,” and will inspire me to further research. Based on the data from a top, I know, for example, which logs should be looked first. Logs I’m looking at using combinations of less, grep, and tail -f.

Step Four. Checking disk space

Even today, when you can carry a terabyte of information in your pocket, you can run out of disk space on the server completely unnoticed. When this happens – you can see very strange things.

Deal with the disk space will help us an old good command df, whose name is an abbreviation of “disk filesystem.” With its help, you can get a summary of the free and used space on the disk.

Usually, df is used in two ways.

$ sudo df -h

Displays data about hard disks in a convenient form for perception. For example, information about the volume of the drive is displayed in gigabytes, and not in the form of an exact number of bytes.

Displays the number of used inodes and their percentage to the file system.

Another useful flag is df-T. It allows you to display data about the types of file system storage. For example, a command of the form $ sudo df -hT shows both the amount of disk space occupied and the data on its file system.

If something seems strange to you, you can dig deeper by using the Costas command. It is part of sys stats, an advanced set of tools for monitoring the system. It displays information about the processor, as well as data on the I / O subsystem for block storage devices, for partitions and network filesystems.

Probably the most useful way to call this command looks like this:

$ iostat -xz 1

This command displays information about the amount of data read and written for the device. In addition, it will show the average I / O time in milliseconds. The more this value – the more likely that the drive is overloaded with requests, or before us – a hardware problem. What exactly? You can use the top utility to find out whether the MySQL server is being loaded (or some other database running on it). If such applications were not found, then there is a possibility that something is wrong with the disk.

Another important indicator can be found in the% util section, which shows information about using the device. This indicator indicates how hard the device works. Values in excess of 60% indicate a low performance of the disk subsystem. If the value is close to 100%, it means that the disk is running at the limit of possibilities.

Working with utilities to test drives, pay attention to what you are analyzing.

For example, a load of 100% on a logical drive, which is a multiple physical disks, can only mean that the system constantly processes some input-output operations. What matters is what happens on physical disks. Therefore, if you are analyzing a logical drive, remember that disk utilities will not provide useful information.

Step five. Checking logs

The last in our list, but only in order, and not in importance – check logs. Usually, they can be found in / var/log, in separate folders for different services.

For newbies in Linux, the log files may look like a horrible mishmash. These are text files that record information about what the operating system and applications are doing. There are two kinds of records. One record is what happens in the system or in the program, for example, every transaction or data movement. The second is error messages. Log files can contain both. These files can be just huge.

The data in the log files usually look pretty mysterious, but you still have to figure it out. Here, for example, is a good introduction to this topic from Digital Ocean.

There are many tools that will help you check the logs. For example, dmesg. This utility displays kernel messages. Usually, there are a lot of them, so use the following simple command-line script to view the last 10 entries:

$ dmesg | tail

Do you want to keep track of what is happening in real time? I definitely need this when I’m looking for problems. To achieve this, use the tail command with the -f switch. It looks like this:

$ dmesg | tail -f /var/log/syslog

The above command monitors the Syslog file, and when it receives information about new events, it displays them.

Here’s another convenient command-line script:

$ Sudo find / var / log -type f -mtime -1 -exec tail -Fn0 {} +

The scans logs and shows possible problems.

If your system uses systemd then you will need to use the built-in logging tool – Journalctl. Systemd centralizes the management of logging with the journald daemon. Unlike other Linux logs, journald stores data in binary rather than text format.

It is useful to set up journald so that it stores logs after the system is rebooted. You can do this using this command:

$ Sudo mkdir -p / var / log / journal

To enable permanent storage of records, you need to edit the /etc/systemd/journald.conf file, including the following:

[Journal] Storage = persistent

The most common way to work with these journals is as follows:

Journalctl -b

It will show all the log entries after the last reboot. If the system was rebooted, you can see what it was before with this command:

$ Journalctl -b -1 

This will allow you to view the log entries made in the previous server session.
Here is a useful material on how to use journalctl.

Logs are very large, it is difficult to work with. Therefore, although you can understand them with the help of command line tools, such as grep, awk, and others, it is useful to use special programs for viewing logs.

For example, I like the system for managing open source logs Graylog. It collects, indexes and analyzes a wide variety of information. It is based on MongoDB for working with data and Elasticsearch for searching by log files. Graylog makes it easy to monitor server status. Graylog, if you compare it with the built-in Linux tools, is easier and more convenient. In addition, among its useful features, you can note the possibility of working with many DevOps-systems, such as Chef, Puppet and Ansible.


No matter how you relate to your server, it may not be included in the Guinness Book of Records as the one who has worked the longest. But the desire to make the server as stable as possible, getting to the heart of the problems and correcting them is a worthy goal. We hope that what we have told today will help you achieve this goal.