When talking about high-scale systems, you can hear a fault-tolerant system. See the description of the Elixir programming language:
Elixir is built on the top of the Erlang VM, known for running low-latency, distributed and fault-tolerant systems, while also being successful in web development and embedded systems. [adapted and translated from elixir-lang.org ]
But I do not understand what a system means to have this feature. Does it mean that the system can recover from failures? What kind of failures should be covered for a system to be fault tolerant? And what strategies are used for a system to be fault tolerant?
It is a characteristic of systems that are able to continue operating in a more or less normal way regardless of any faults in any of the parts necessary for its operation.
How to get this feature varies, and may involve hardware or software solutions. Almost always tolerance is obtained with some reservation system, replication, mirroring, redundancy, or something, it also usually has some sort of monitoring and scheduling when there is something wrong. But one way to help is to write codes that are robust, that is, they are always prepared for a fault to occur and can do something useful with it. But remember that tolerance will often be due to the infrastructure adopted.
In software solutions, you often anticipate problems and do not let them happen or after you happen to have some way that allows you to reexecute or depart to another form that delivers the desired result. A simple system that detects errors in the software and gives a solution can already be considered fault tolerant at some level. Usually we only use the term when everything is resolved without direct human intervention.
In general this tolerance is somewhat limited and in each situation it is made explicit in which cases the operation can continue normal. Obviously it always has fault levels and the more tolerant of all kinds of failures the system needs to be, the more complex it will be, in some cases they can only be tolerant with much replication in parts of the world. In others only having a way to solve if one of the software of the solution fails another solves the work or gives some useful result anyway.
So the term is often used as marketing when it does not specify the level of tolerance.
There is no guarantee that tolerance will allow normal operation at all times, only that it will not stop altogether. In some cases delivering the result is neither the intention, just not stopping working is already a good goal.
It is important that anything wrong occur in the middle of the failure process that can be reversed or can be contained without contaminating other parts.
Some mechanisms are quite sophisticated, complex, and expensive.
There are no tools that do this magically as some might want. Of course, you can hire some service that will give you something ready, but you will never get it without great effort from someone.
For all this it is difficult to talk about specific types of failures and strategies, each solution happens in a way according to each type of system.