SRE Network Engineer
About the job
- Effectively manage troubleshooting and recovery of complex production incidents, ranging from low to critical impacts.
- Drive incident resolution through a systematic problem-solving approach, coupled with a strong sense of ownership and drive.
- Actively participate in teams’ Agile stories (project work) to streamline and enhance day to day operations of the team.
- Create, manage, and utilize appropriate technical procedural documentation (run books).
- Proactively monitor all applications and infrastructure behind TokenEx’s external and internal customer-facing services, including availability, latency, performance, and capacity.
- Influence resiliency and scalability in production environments in Azure and Amazon Web Services (AWS).
- Assist with conducting Root Cause Analysis (RCA) on critical production outages, develop and implement mitigation strategies
- Utilize production support expertise to influence and support new designs, architectures, standards, and methods, maintaining stability and availability for large-scale distributed systems
- Proactively identify and implement opportunities for automation of routine maintenance tasks, data gathering, and resolution of common issues.
- Continuously seek to develop new skills and technical expertise, as well as proactively share knowledge with others.
- Build software and systems to manage platform infrastructure and applications to improve reliability, quality, and time-to-market of our suite of software solutions.
- Gather and analyze operating systems/applications metrics to assist in performance tuning and fault finding.
- Participate in system design consulting, platform management, capacity planning, testing & release procedures.
- Create sustainable systems and services through automation and uplifts.
- Balance feature development speed and reliability with well-defined service level objectives.
- Perform disaster recovery operations, monitor network performance, and troubleshoot, diagnose, and resolve hardware, software, and other network and system problems.
- Bachelor’s Degree in Computer Science preferred but not required or relevant experience
- 5+ years of software development experience, ideally in an Agile SaaS/product development company
- In-depth understanding of web service protocols and REST API design and consumption
- Excellent .NET (C#) development and debugging skills
- Experience with both container and serverless computing
- Microsoft Azure/AWS developer/architecture certifications preferred
- Skilled in Cloud/PaaS Environments (e.g., AWS, Azure), LAN, WAN, Network Security
- Proficient, collaborative, & experienced in building reliable, scalable, enterprise systems
- Ability to identify root-cause sources of instability in a high-traffic, large-scale distributed systems
- Linux administration, troubleshooting, and performance tuning experience
- Understanding of observability principles (monitoring, logging, tracing, alerting), tools and practices that promote observability
- Experience with continuous integration tools (e.g., GitLab, AWS CodeBuild, CodeDeploy, CodePipeline, Azure DevOps)
- Trouble-shooting skills that span systems, network, and code Strong understanding of network infrastructure and network hardware
- Ability to implement, administer, and troubleshoot network infrastructure devices, including firewalls and load balancers
- Configuration management and orchestration (e.g., Terraform, Cloud Formation, Ansible, Chef)