See all the jobs at Tookitaki Holding PTE LTD here:
| Technology | Full-time
Position Overview
Job Title: Site Reliability Engineer (SRE)
Department: Technology
Location: Bangalore
Reporting To: Head of Infra
Tookitaki is looking for a Site Reliability Engineer (SRE) with 3–6 years of experience to help maintain and scale the infrastructure that powers our flagship products—FinCense and the AFC Ecosystem. As an SRE, you will work at the intersection of software engineering and infrastructure, ensuring high availability, performance, and scalability of our platforms.
You will collaborate with engineering, DevOps, and client success teams to operationalize deployments across on-premise, VPC, and Compliance as a Service (CaaS) environments while improving monitoring, automation, and incident response.
Position Purpose
The SRE role is responsible for ensuring the reliability and efficiency of Tookitaki’s production systems and environments. This includes building monitoring systems, improving deployment pipelines, automating routine operations, and responding to production incidents. You’ll help build a resilient infrastructure that supports our mission to provide AI-driven solutions that prevent financial crime.
Key Responsibilities
-
System Monitoring & Incident Management
-
Build and maintain monitoring, alerting, and logging systems using tools like Prometheus, Grafana, and ELK.
-
Respond to incidents and outages, conduct post-mortems, and implement corrective actions.
-
Infrastructure & Deployment Automation
-
Automate infrastructure provisioning and application deployment using Terraform, Ansible, or Helm.
-
Contribute to CI/CD pipelines, improve reliability and speed of software delivery (GitLab CI, Jenkins, etc.).
-
Container & Orchestration Management
-
Manage and troubleshoot Docker containers and Kubernetes clusters, ensuring workload scaling, resource management, and health.
-
Support application updates, rollbacks, and blue-green or canary deployments.
-
Cloud & Platform Operations
-
Operate within AWS (preferred) or GCP environments (EC2, S3, VPC, IAM).
-
Monitor system availability and resource usage across environments.
-
Security & Reliability Enhancements
-
Implement and monitor TLS/SSL, RBAC, SSO, and secure API practices.
-
Support compliance and security audit activities by maintaining logs, access controls, and operational hygiene.
-
Collaboration & Documentation
-
Work closely with developers, infra engineers, and support teams to ensure production readiness.
-
Maintain playbooks, runbooks, and system documentation for reliability engineering activities.
Qualifications and Skills
Education
-
Bachelor’s degree in Computer Science, Engineering, or related technical field.
Experience
-
3–6 years in Site Reliability Engineering, DevOps, Platform Engineering, or a related role.
-
Experience with production environments and live system debugging.
Technical Skills
-
Kubernetes, Docker, Helm – experience deploying and scaling services.
-
Linux administration and command-line debugging.
-
Hands-on with AWS (preferred) or GCP cloud platforms.
-
Scripting in Bash and Python for automation and monitoring tasks.
-
Experience with monitoring and alerting tools like Prometheus, Grafana, ELK, or Datadog.
-
Familiarity with databases (e.g., MariaDB, ScyllaDB) and SQL/CQL querying.
Soft Skills
-
Strong problem-solving and debugging skills.
-
Ability to work in on-call rotations and high-pressure production environments.
-
Excellent communication and documentation abilities.
Key Competencies
-
Operational Reliability: Ensures system uptime and performance through proactive monitoring and maintenance.
-
Automation Mindset: Reduces manual effort through scripting and tooling.
-
Incident Response: Quick identification and resolution of issues to minimize downtime.
-
Cross-Functional Collaboration: Works effectively with engineering, support, and infra teams.
-
Security Awareness: Applies best practices in infrastructure and platform security.
Success Metrics
-
Maintain 99.9%+ uptime across production environments.
-
Reduce mean time to detect (MTTD) and mean time to resolve (MTTR) for critical incidents.
-
Increase in automation coverage and reduction in manual deployment steps.
-
High internal satisfaction from developers on CI/CD and platform reliability.
-
Compliance readiness and security log availability for audits.
Benefits
-
Competitive compensation
-
Work on a globally recognized RegTech platform transforming financial crime prevention.
Exposure to cutting-edge AI and big data infrastructure (Spark, Kafka, ScyllaDB, Flink).