Cloud Operations & Monitoring

Bengaluru, Karnataka, India | Technology | Full-time

Apply

Job Description: Cloud Ops & Monitoring Engineer

Job Title: Cloud Ops & Monitoring Engineer
Location: Bangalore
Department: Technology
Reporting To: Cloud Infra Director

Position Overview

Tookitaki is seeking a Cloud Ops & Monitoring Engineer to ensure the stability, performance, and security of our cloud-based infrastructure across all product offerings. This role is crucial in maintaining high availability, optimizing cloud operations, and proactively monitoring our cloud environments. The ideal candidate will have deep expertise in cloud platforms, automation, and observability tools to drive incident response, cost optimization, and operational efficiency.

Position Purpose

The Cloud Ops & Monitoring Engineer is responsible for monitoring, optimizing, and maintaining Tookitaki’s cloud infrastructure. This role ensures high system reliability, proactive incident management, and efficient resource utilization. By leveraging automation and advanced monitoring tools, the engineer will drive operational excellence, minimize downtime, and enhance cloud security.

Key Responsibilities

Cloud Operations Management

  • Monitor and manage cloud infrastructure (AWS, GCP, Azure) for performance, availability, and security.

  • Ensure 99.99% uptime of mission-critical systems through proactive maintenance and incident resolution.

  • Implement best practices for cloud governance, cost optimization, and capacity planning.

Monitoring & Incident Response

  • Set up and maintain observability tools (Prometheus, Grafana, ELK stack, Datadog, New Relic).

  • Develop real-time monitoring and alerting mechanisms to detect anomalies before they impact operations.

  • Act as the first responder for production incidents, ensuring swift issue resolution and root cause analysis.

Automation & Infrastructure Optimization

  • Develop and maintain Infrastructure as Code (IaC) scripts (Terraform, CloudFormation) for cloud automation.

  • Automate cloud scaling, log management, and incident resolution workflows.

  • Optimize cloud environments for performance, security, and cost efficiency.

Security & Compliance Enforcement

  • Implement security best practices, including IAM policies, encryption, and vulnerability management.

  • Work closely with security teams to detect and mitigate threats in cloud environments.

  • Ensure compliance with global financial regulatory standards (GDPR, PCI-DSS, SOC 2).

Cross-Team Collaboration & Reporting

  • Collaborate with DevOps, Security, and Development teams to enhance cloud performance.

  • Provide operational insights and reports on cloud system health, trends, and optimization opportunities.

  • Document incident reports, troubleshooting steps, and operational playbooks for continuous learning.

Key OKRs

  • Maintain 99.99% system uptime by proactively monitoring and resolving cloud incidents.

  • Reduce cloud operational costs by 20% through optimization and automation.

  • Automate 80% of cloud monitoring and alerting processes within six months.

  • Ensure 100% compliance with cloud security policies and regulatory standards.

  • Improve MTTR (Mean Time to Resolution) by 30% for critical incidents.

Qualifications and Skills

Education

  • Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field.

  • Certifications in AWS, Azure, Google Cloud, or Kubernetes (preferred).

Experience

  • 5+ years of experience in cloud operations, monitoring, or DevOps roles.

  • Proven experience in managing highly available, production-grade cloud environments.

Technical Expertise

  • Proficiency in AWS, GCP, or Azure cloud services.

  • Strong hands-on experience with monitoring tools (Prometheus, Grafana, ELK, Datadog, New Relic).

  • Expertise in Infrastructure as Code (IaC) tools (Terraform, CloudFormation).

  • Experience with containerization and orchestration (Docker, Kubernetes).

  • Knowledge of cloud security, IAM policies, encryption, and threat detection.

  • Familiarity with CI/CD pipelines, scripting (Python, Bash), and cloud automation.

Soft Skills

  • Analytical mindset with strong troubleshooting and problem-solving abilities.

  • Excellent communication skills to work cross-functionally with multiple teams.

  • Proactive and detail-oriented, with a focus on continuous improvement.

  • Ability to work in a fast-paced, dynamic environment with tight deadlines.

Key Competencies

  • Cloud Monitoring & Performance Optimization: Ensures system health and efficiency through real-time observability.

  • Incident Management & Troubleshooting: Rapidly diagnoses and resolves production issues with minimal downtime.

  • Automation & Infrastructure Management: Implements self-healing and scalable cloud solutions.

  • Security & Compliance Awareness: Ensures adherence to regulatory standards and cloud security best practices.

  • Cross-Functional Collaboration: Works closely with engineering, security, and DevOps teams to enhance cloud operations.

Success Metrics

  • Maintain 99.99% system uptime, ensuring minimal service disruption.

  • Reduce MTTR (Mean Time to Resolution) for critical incidents by 30%.

  • Automate 80% of cloud monitoring and incident response workflows.

  • Optimize cloud resource utilization, achieving a 20% cost reduction.

  • Implement a fully operational cloud observability framework within six months.

Benefits

  • Competitive Salary: Aligned with industry standards and experience.

  • Professional Development: Access to training in big data, cloud computing, and data integration tools.

  • Comprehensive Benefits: Health insurance and flexible working options.

  • Growth Opportunities: Career progression within Tookitaki’s rapidly expanding Services Delivery team.