Software Engineering

Senior SRE - GCAB

Mumbai, Maharashtra
Work Type: Full Time
We are hiring for a leading fintech platform company in to b2b lending space for sme/manufacturing/logistics etc into web & mobile space

Looking for candidates from product/saas/startup companies

Job title: Senior SRE
Experience: 5+ Years 
Location: Mumbai (Work from office)


Job Description

The TechOps Engineering team is responsible for the security and integrity of the tech stack and cloud environment which is paramount to the success of our FinTech platform. The teams’ focus area is to secure and protect CredAble’s assets such as customer or payment information and handle potential data breaches or develop tools in partnership with other technical teams. As a Senior Site Reliability Engineer in TechOps, you will work with our teams to maintain, operate, and manage production sites, assess technology risk, monitor security controls, and lead overall resilience on all those initiatives.

Best things about the job:
    Working in a highly entrepreneurial setup with a visionary team passionate to help scale
    new heights of business success.
    Exposure to exploring limitless possibilities and ideas no matter how impossible they may
    seem today.
    CredAble thrives on transparency and culture to nurture growth.
    Being part of CredAble enables you to push beyond the ordinary.


What you'll be doing:

    Identifies significant projects that result in substantial improvements in reliability, cost savings and/or revenue.
    Identifies changes for the product architecture from the reliability, performance and availability perspectives with a data driven approach.
    Influences the product roadmap and works with engineering and product counterparts to influence improved resiliency and reliability of the platform.
    Proactively work on efficiency and capacity planning to set clear requirements and reduce system resource usage.
    Identify parts of the system that do not scale, provide immediate palliative measures and drive long-term resolution of these incidents.
    Identify Service Level Indicators (SLIs) that will align the team to meet the availability and latency objectives.
    Be on an on-call (PagerDuty) rotation to respond to incidents that impact availability, and provide support for service engineers with customer incidents.
    Use your on-call shift to prevent incidents from ever happening.
    Run our infrastructure with Chef, Ansible, Terraform, CI/CD and Kubernetes hosted on the multi-Cloud environment – AWS, Azure, and GCP.
    Build monitoring that alerts on symptoms rather than on outages.
    Document every action so your findings turn into repeatable actions and then into automation.
    Improve operational processes (such as deployments and upgrades)
    Design, build and maintain the core infrastructure of the tech platform.
    Debug production issues across services and levels of the stack.
    Help establish security best practices across the technical expanse. All production environments in a highly distributed and multiple public cloud ecosystem.
    Analyze our cloud security posture, identify gaps, and work closely with other teams to ensure strong operational security.

    Overall govern the accountability of continuous operational readiness and working status of the technology stack, in the production environment.
    Provide ongoing maintenance and improve system health and reliability of cloud services.
    Participate in design and implementation reviews of security and infrastructure projects.
    Lead and mentor junior SRE engineers and security analysts.

Qualifications

    5+ years of experience working on or with Cloud DevOps teams and security teams.
    Expertise in SRE/DevOps tasks, including deploying and maintaining production services.
    Strong knowledge of security topics including network and application security, infrastructure hardening, security baselines, and web server/database security.
    Expertise in modern configuration management tools (i.e Salt, Ansible, Fabric, etc.) and CI/CD systems.
    Expertise in AWS, GCP and/or other cloud environments.
    Excellent technical writing and documentation skills.

1.    Technical

    Advanced Chef (syntax, recipes, cookbooks) and Ansible (syntax, tasks, playbooks)
    Advanced Terraform syntax and CI/CD configuration, pipelines, jobs
    Advanced knowledge of cloud services
    Kubernetes: cluster provisioning and new services
    Prometheus, Thanos, and Grafana: service catalogue metrics and recording rules for alerts
    Configuration management: use Chef and Ansible to effectively manage our infrastructure
    Infrastructure as code: use Terraform and CI/CD automation, containerize environments (Kubernetes), and leverage cloud technologies to meet our goals
    Systems: manage, configure and troubleshoot operating system issues, storage (block and object), networking (VPCs, proxies and CDNs), and administer high-availability PostgreSQL and Redis clusters
    Monitoring and instrumentation: implement metrics in Prometheus, Grafana, log management and related system, and Slack/PagerDuty integrations
    Engineering practices: availability, reliability and scalability, as well as disaster recovery
    Work in a variety of languages: Shell, Ruby, GoLang, Python
    Log shipping pipelines and incident debugging visualizations
    Operating system (Linux) configuration, package management, startup and troubleshooting
    Block and object storage configuration and debugging


2.    Collaboration and Communication:

    Leads initiatives and problem definition and scoping, design, and planning through epics and blueprints.
    Deep domain knowledge and radiation that knowledge through recorded demos, technical presentations, discussions, and Incident Reviews.
    Perform and run blameless RCAs on incidents and outages aggressively looking for answers that will prevent the incident from ever happening again.
    For stable counterpart assignments, maintain awareness and actively influence stage group plans and priorities through participation in stage group meetings and async discussions. Act as a champion for reliability.

3.    Influence and Maturity:

    Set an example for a team of SREs with positive and inclusive leadership and discussion on work.
    Show ownership of a major part of the infrastructure.
    Trusted to de-escalate conflicts inside the team.

4.    Nice to have:

    B.S. degree in Computer Science.
    Experience working with container technology including Docker and Kubernetes.
    Relevant Cloud Certifications – AWS / GCP / Azure.
    Knowledge of network-based, system-level, and application-layer attacks and mitigation methods.


You may be a fit for this role if you have some of these inclinations:

    Think about systems: edge cases, failure modes, behaviours, and specific implementations.
    Know your way around Linux and Unix Shell.
    Know what is the use of configuration management systems like Chef and Ansible.
    Have strong programming skills: Shell, Python, Ruby and/or Go.
    Have the urge to collaborate and communicate asynchronously.
    Have the urge to document all the things so you don't need to learn the same thing twice.
    Have an enthusiastic, go-for-it attitude. When you see something broken, you can't help but fix it.
    Like to mentor the team and share knowledge.
    Maintain self-awareness, and can handle conflict in the team, by providing and receiving feedback.
    Maintain good relationships with other engineering teams
    Have an urge for delivering quickly and effectively and iterating fast.
    Share our values, and work in accordance with those values.
    Have experience with Nginx, Docker, Kubernetes, Terraform, or similar technologies

Submit Your Application

You have successfully applied
  • You have errors in applying