Availability

Learn how Sedai helps avoid downtime through proactive remediations.

This feature is best for production resources that experience regular traffic and usage.

Sedai continuously analyzes traffic patterns and performance to build a baseline profile of normal resource behavior. If a resource starts to deviate from its usual state, Sedai's prediction engine generates a remediation to intervene and avoid downtime. Remediations must pass a series of safety checks that validate Sedai's confidence and ability to safely execute in production. Completed remediations are assessed to ensure changes achieved their intended effect and fed into the system's ML feedback loop. This informs its decision model to continually learn and intelligently adapt to production behavior in real-time.

Support

Resource Type

Issue Detection

Serverless

(AWS Lambda)

Throttling, timeouts, OOMs, errors

Containers (AWS ECS/Fargate)

Errors, out of memory/CPU, timeouts

Virtual Machines (AWS EC2, Azure VM)

Errors, out of memory/CPU, timeouts

Kubernetes

(Stateless Workloads)

Errors, out of memory/CPU, timeouts

In addition to remediations, Sedai will also generate an alert if it detects unhealthy behavior but is unsure how to intervene.

For each of the resource types mentioned above, you can individually control when a remediation is executed once Sedai detects an issue:

Setting

Description

Autopilot Automatically executes remediation

Sedai proactively detects performance issues, automatically determines the best course of action, and executes changes on your behalf. This mode is designed for teams looking to fully leverage Sedai's capabilities to maximize efficiency and minimize manual intervention. It's important to note that the system only acts after passing rigorous safety checks to ensure that no changes will negatively impact resource availability or performance. This mode is ideal for organizations that trust the platform's decision-making and want to focus their efforts on innovation rather than routine cloud management tasks.

Copilot Requires approval to execute

This mode allows Sedai to make recommendations based on its analysis, but requires human approval before any actions are taken. The system will identify an issue and generate a detailed recommendation, then notify you to review and approve the action for Sedai to execute. If the recommendation is executed, Sedai uses reinforcement learning to train its models for future issue detection and remediation. This mode is suitable for teams that want to maintain oversight and control over cloud operations while still benefiting from Sedai's advanced analysis and automation capabilities. Copilot is ideal for teams transitioning to a more autonomous approach but still requires a level of human verification.

Datapilot

Get alerted on potential issues

This mode enables Sedai to monitor cloud resources, analyze performance metrics, and flag potential issues without taking any direct action. This mode allows Sedai to make recommendations based on its analysis, but requires human approval before any actions are taken. The system will identify an issue and generate a detailed recommendation, then notify you to review. Datapilot is designed for teams that want to retain full control over cloud operations, using Sedai as a powerful decision-support tool. This mode is particularly useful for organizations that are new to autonomous cloud management or have stringent compliance requirements that necessitate manual intervention. (Note: If you do not approve the issue within 4 hours of generation, recommendations will automatically expire due to the time sensitivity of most issues)

By default, availability is set to Datapilot for all resource types.

Once a remediation is submitted for execution, it must first pass a series of safety checks to ensure Sedai can safely intervene. If the safety checks fail, then the remediation will not be attempted.

You can view a history of remediations and recommendations in the Activity Timeline page. Open recommendations can be viewed in the Tasks page.

You can optionally configure availability settings using tags.

Last updated 7 months ago

Was this helpful?