Redefining Software Resilience: The Era of Artificial Immune Systems

Charith De Silva
6 min readAug 18, 2023
© Charith De Silva

The cost of software failures keeps increasing annually. According to the Consortium for Information & Software Quality¹, In 2022, Operational failures in software cost $1.56T in the US alone. To put that in everyday vernacular, 2.6 million Batwings!

It’s no joke that software failures cost billions of dollars a year. Software failures occur due to various issues, ranging from human errors to oversight during development and poor-quality gates. Robotic Process Automation is one technology extensively used to automate repetitive tasks reducing human-induced failures. Even with this, we are not eliminating the possibility of operational issues. Operational issues can vary from performance issues to usability issues to security issues. Rectifying operational issues require specialized human intervention, or does it?

Self-healing software isn’t a new or uncommon concept. Software that can self-restart when encountering an error is a rudimentary example of a self-healing system. However, with the evolution of artificial intelligence, self-healing systems can further expand their capabilities. Artificial Intelligence techniques, such as machine learning and predictive analytics, will help to identify patterns, predict potential failures, and suggest preventive measures.

Self-healing software revolves around three concepts, error detection, isolation, and recovery. However, this model is prone to inherent limitations. Because self-healing systems rely on predefined error models, unseen or unexpected errors that don’t match the existing models might go undetected, leading to system instability. In addition, these methodologies typically necessitate human intervention in situations such as defining error models, designing recovery procedures, or handling errors that the system cannot rectify autonomously. Present self-healing methods focus on reactive healing rather than proactive healing. Although this reactive approach fixes the system, it might not always prevent system instability.

Bayesian Networks
Bayesian networks are probabilistic models representing the conditional dependencies among a set of variables. Bayesian networks can be a great way to build error models for self-healing software. Let’s picture a simple software system that consists of a client-serving application. The application is dependent on the application server and backend service. The backend service relies on a database and an external service exposed via an API. The connectivity between the backend service, the API, and the database depends on the network connectivity. If we represent this scenario using a Cyclic Directional Graph, a self-healing system can detect issues based on the graph and provide a resolution.

Based on this diagram, slowness in the backend service can result from either the database or the API.

Artificial Immune Systems (AIS)
Artificial Immune Systems, inspired by the vertebrate immune system, provide an innovative approach to designing self-healing software. By emulating the biological immune system’s ability to adapt, learn, and remember, AIS can empower software systems to detect, diagnose, and fix issues autonomously. AIS offers a framework that enables the software to learn from each interaction, adapt to system changes, and remember past faults and their resolutions. AIS leads to a more robust, resilient system capable of tackling an array of unpredictable errors and vulnerabilities.

The vertebrate immune system consists of innate immunity and adaptive immunity. Innate immunity protects us against known pathogens. Innate immunity is always non-specific and general. Present self-healing software models closely resemble innate immunity. Adaptive immunity can learn from current threats and apply the knowledge to handle future situations. At its core, these systems mimic the vertebrate immune system’s differentiation of self and non-self entities. The standard operating conditions are considered self, and anything that does not conform to self becomes non-self.

AIS works with three artificial cell types, detectors, memory cells, and antibodies. Detectors are the software equivalent of immune cells, which incessantly monitor the system. They recognize patterns and pinpoint irregularities, such as software bugs, errors, or security vulnerabilities. Memory cells are like a system’s record-keeping mechanism, storing information about past ‘infections’ and responses. Antibodies are the corrective responses created by the system when a detector identifies a problem. Detector cells, memory cells, and antibodies help AIS to achieve the three main characteristics of the adaptive immune system, adaptation, learning, and memory.

Although the diagram above is a quintessential example of simplification, it captures the components needed to build an AIS at a high level. The application should have traceability built in. The Detector module will continuously monitor application logs and operational metrics and refer to the decision maker if it detects any anomalies. The Decision Maker will provide a potential solution based on the available patterns, which then will be applied to the application. After verifying the status of the application, The Verifier will communicate the effectiveness of the resolution to the decision-maker. The decision-maker will use this feedback for learning.

Predictive Analytics and Machine Learning
Predictive analytics uses historical and current data to forecast future events, trends, or behaviors. With sufficient historical failure data, predictive models can predict the likelihood of software failures. Machine learning techniques can identify patterns of events that typically lead to instability or crashes. When these patterns begin to emerge in real-time data, the software can take preventative action to stop the failure from occurring. For example, if the software predicts an imminent memory overflow, it might automatically free up resources or restart imperiled services to prevent a crash.
By learning normal system behavior (“self”), predictive models can help identify anomalies or unusual activity (“non-self”) that might indicate a problem. System anomalies could include log entries signifying errors, CPU/Memory spikes, abnormal network traffic patterns, or suspicious user behavior indicating a security breach.

Cybersecurity
The uses of AIS go beyond self-healing software systems. AIS concepts can detect and prevent security vulnerabilities as well. By learning the self-state of user interactions, AIS can call out any user actions that do not conform to self, and based on the available knowledge, it can take preventive measures.

Augmented SRE
The advancement of self-healing and autonomous self-healing software will significantly impact the area of Site Reliability Engineering (SRE). We are looking at an Augmented-SRE in the near-long term. Until self-healing can be fully-autonomous, SREs will have to get involved to maintain system stability. But rather than SRE being active in monitoring, with self-healing software, the software will notify the SRE when it needs assistance. The SRE can converse with the system to identify and assist in healing itself.

Security & Ethics
As with any autonomous technology, AIS raises several ethical and security considerations. For instance, there are concerns about the potential misuse of such technology in malware, where a malicious program could learn from its environment and adapt to avoid detection. Also, the autonomous nature of these systems could lead to actions that are hard to predict or control, raising concerns about accountability and transparency. Another potential challenge is that a malicious entity, equipped with enough understanding of the system, could deceive it into ignoring incidents as false positives.

Self-healing software and AIS have vastly improved in the last few years. With large language models (LLM) showing the ability to write code and fix bugs, fully-autonomous software might be closer than we think. Although self-healing software will increase system stability and security, It might not reduce operational costs immediately. As magnificent as these systems are, they are not easy to build or cost-effective. The cost-benefit of such an autonomous self-healing software will initially come as an improvement in user experience. Recovering the initial investment will be a long-term objective.

--

--