Site Reliability Engineer at Tookitaki Holding PTE LTD

See all the jobs at Tookitaki Holding PTE LTD here: http://tookitaki78808.recruiterbox.com/jobs

Site Reliability Engineer

Bangalore | Technology | Full-time

Position Overview

Job Title: Site Reliability Engineer (SRE)
Department: Technology
Location: Bangalore
Reporting To: Head of Infra

Tookitaki is looking for a Site Reliability Engineer (SRE) with 3–6 years of experience to help maintain and scale the infrastructure that powers our flagship products—FinCense and the AFC Ecosystem. As an SRE, you will work at the intersection of software engineering and infrastructure, ensuring high availability, performance, and scalability of our platforms.

You will collaborate with engineering, DevOps, and client success teams to operationalize deployments across on-premise, VPC, and Compliance as a Service (CaaS) environments while improving monitoring, automation, and incident response.

Position Purpose

The SRE role is responsible for ensuring the reliability and efficiency of Tookitaki’s production systems and environments. This includes building monitoring systems, improving deployment pipelines, automating routine operations, and responding to production incidents. You’ll help build a resilient infrastructure that supports our mission to provide AI-driven solutions that prevent financial crime.

Key Responsibilities

System Monitoring & Incident Management

Build and maintain monitoring, alerting, and logging systems using tools like Prometheus, Grafana, and ELK.
Respond to incidents and outages, conduct post-mortems, and implement corrective actions.

Infrastructure & Deployment Automation

Automate infrastructure provisioning and application deployment using Terraform, Ansible, or Helm.
Contribute to CI/CD pipelines, improve reliability and speed of software delivery (GitLab CI, Jenkins, etc.).

Container & Orchestration Management

Manage and troubleshoot Docker containers and Kubernetes clusters, ensuring workload scaling, resource management, and health.
Support application updates, rollbacks, and blue-green or canary deployments.

Cloud & Platform Operations

Operate within AWS (preferred) or GCP environments (EC2, S3, VPC, IAM).
Monitor system availability and resource usage across environments.

Security & Reliability Enhancements

Implement and monitor TLS/SSL, RBAC, SSO, and secure API practices.
Support compliance and security audit activities by maintaining logs, access controls, and operational hygiene.

Collaboration & Documentation

Work closely with developers, infra engineers, and support teams to ensure production readiness.
Maintain playbooks, runbooks, and system documentation for reliability engineering activities.

Qualifications and Skills

Education

Bachelor’s degree in Computer Science, Engineering, or related technical field.

Experience

3–6 years in Site Reliability Engineering, DevOps, Platform Engineering, or a related role.
Experience with production environments and live system debugging.

Technical Skills

Kubernetes, Docker, Helm – experience deploying and scaling services.
Linux administration and command-line debugging.
Hands-on with AWS (preferred) or GCP cloud platforms.
Scripting in Bash and Python for automation and monitoring tasks.
Experience with monitoring and alerting tools like Prometheus, Grafana, ELK, or Datadog.
Familiarity with databases (e.g., MariaDB, ScyllaDB) and SQL/CQL querying.

Soft Skills

Strong problem-solving and debugging skills.
Ability to work in on-call rotations and high-pressure production environments.
Excellent communication and documentation abilities.

Key Competencies

Operational Reliability: Ensures system uptime and performance through proactive monitoring and maintenance.
Automation Mindset: Reduces manual effort through scripting and tooling.
Incident Response: Quick identification and resolution of issues to minimize downtime.
Cross-Functional Collaboration: Works effectively with engineering, support, and infra teams.
Security Awareness: Applies best practices in infrastructure and platform security.

Success Metrics

Maintain 99.9%+ uptime across production environments.
Reduce mean time to detect (MTTD) and mean time to resolve (MTTR) for critical incidents.
Increase in automation coverage and reduction in manual deployment steps.
High internal satisfaction from developers on CI/CD and platform reliability.
Compliance readiness and security log availability for audits.

Benefits

Competitive compensation
Work on a globally recognized RegTech platform transforming financial crime prevention.

Exposure to cutting-edge AI and big data infrastructure (Spark, Kafka, ScyllaDB, Flink).

See all the jobs at Tookitaki Holding PTE LTD here: http://tookitaki78808.recruiterbox.com/jobs

Position Overview

Position Purpose

Key Responsibilities

Qualifications and Skills

Education

Experience

Technical Skills

Soft Skills

Key Competencies

Success Metrics

Benefits

Apply for this opening at ?apply=true

See all the jobs at Tookitaki Holding PTE LTD here: http://tookitaki78808.recruiterbox.com/jobs

Application Form