Are you passionate about AWS operations and SaaS? Do you like learning new technologies and solving problems at scale? Do you thrive in a fast-paced, competitive, team-oriented environment? SuccessKPI is seeking an AWS Cloud Operating Engineer to work in our Chantilly, VA office and remotely from India or Poland.
This position will require you to work as a part of a team operating and scaling multiple AWS deployments worldwide. We are looking for team members that can think critically, act independently and operate with a data driven mindset while remaining focused on our large enterprise and government clients. Do you have what it takes?
- Bachelor Degree or equivalent preferred
- Area of Study: Computer Science or IS/IT preferred
- 6-8 years of related experience on Production Support
- 2-3 years of related hands-on experience on AWS
- Broad knowledge of the AWS platform with AWS Certifications required
- Knowledge of underlying AWS services – including but not limited to: AMIs, Route53, VPC, EC2, S3, IAM, AWS CLI, EBS, ELB, SQS, Cloud Watch, CloudTrail, AWS Lambda, DynamoDB and other similar tools
- Experience with Docker/Kubernetes and container orchestration
- Hands on experience in AWS provisioning of systems with security and reliability in mind (e.g. securing of VPCs and subnets, implementation of security groups, management of identity and access management, scheduling backups, planning for and managing restoration and disaster recovery)
- Managing uptime and performance including system health monitoring and optimizing performance (e.g. using CloudWatch and related tools)
- Administration of web servers running Apache and Tomcat
- Network experience including management of DNS, certificate management, load balancing, firewall configuration and routing. Broad experience with software-defined and traditional networking
- Strong understanding of Linux, including experience with server administration, monitoring, and troubleshooting
- Broad experience with IaaS and PaaS
- Experience building cloud infrastructure using infrastructure-as-code tools like AWS CloudFormation or Terraform
- Previous operations experience in cloud environments at scale
Your future duties and responsibilities
- Provide operations support for office or business unit users of proprietary or custom application software in a 24/7/365 environment supporting Cloud Operations.
- Work schedules may vary including some non-traditional business hours to support large scale cloud platforms that support mission critical applications
- Take point on “end to end” support and smooth operations of cloud based infrastructure, support change windows, incident response and resolution and other scheduled maintenance activities.
- Follow Incident Management, Change Management and Root Cause Analysis
- Gain business and application knowledge through training and resolving production operations incidents and inquiries
- Incident Management
- Triage and resolve Production incidents related to the cloud platform and participate in root cause analysis and post mortem discussions.
- Analyze cloud platform related Production incidents and engage business teams(s) to determine impact of incident
- Work with application support members and cloud support vendors to identify “work-arounds” to situations where permanent solutions cannot be applied in a timely manner
- Provide a collaborative conduit between application/support teams and the Cloud vendor support such as AWS, GCP, Azure etc.
- Escalate to team leads in a timely manner when resolution cannot be achieved
- Recreate and test possible solutions and/or workarounds in lower environments prior to implementing in Production
- Work closely with Cloud Engineering team and other support staff to identify and resolve incidents and create and implement long term remediation techniques and fixes
- Identify and document known issues and work with Cloud engineering partners and vendor support to address reoccurrence and the identified workaround activity
- Operations, Monitoring, and Capacity Planning
- Manage Cloud operations and infrastructure management – rehydration activities, IAM, security and compliance, availability, data protection, authentication and authorization, capacity and resource management, service metering and operational cost oversight, disaster recovery and mitigation.
- Create processes to measure system effectiveness and identify areas for improvement
- Create processes intended to provide environment security, as well as automated processes to provide information on current specifications.
- Stay abreast of new technologies in the field and provide recommendations to organizational management on new solutions
- Identify, correct, and enhance important software tools; seek ways to enhance systems operations, with a focus on automation and minimizing cost
- Build effective monitoring, alerts, and metrics for production services
- Plan for adequate capacity of systems based on utilization metrics and planned projects to establish supply and demand forecasts
- Change Management
- Work closely with internal team members and other stakeholders to review proposed changes and help devise post implementation verification routines and system health checks
- Assist in testing changes in lower environments to ensure solution is as desired
- Create and review operational change tickets with senior team members when changes to Production are needed ensuring they are complete, clear and concise
- Review operational change tickets with senior team members after they are submitted by other teams to make sure they are complete, clear and concise and meet all requirements of the change standard
- Communicate throughout change management activities
- Coordinate emergency changes per standard
- Compliance and Security
- Provide assistance in maintaining compliance with password resets, access reviews, remediation of Operational Incidents
- Assist in documenting remediation steps for operational incidents
- Engage with management, risk and compliance teams as needed
- Assist with compliance and security audit requests
- Virtual Work opportunity
- Opportunity to work in a fast-growing company
- Medical, Dental and Vision Insurance
- Stock Option Plan
- Open Time Off
- Six weeks paid maternity leave
- 11 paid Company holidays
- 401K Savings Plan – a retirement planning vehicle that provides you the opportunity to benefit from pre-tax savings and offers an employer match of 100% up to 1% and 50% up to 6% with an annual cap.
- Basic Life Insurance provided at no cost to you
- Accidental Death & Dismemberment Insurance coverage in case of accidental death or terminal illness.
- Long Term Disability Insurance can replace part of your income if a disability keeps you out of work for a long period of time.
- Short Term Disability Insurance pays you a weekly benefit if you have a covered disability that keeps you from working.
- Work-life balance EAP – Work-Life Balance Employee Assistance Program provides insureds and their dependents with confidential, experienced assistance in dealing with day-to-day life issues or crisis support. The Work-Life Balance program helps keep employees productive at work by helping them deal effectively with personal or professional goals and challenges.
- Development and career growth opportunities
About SuccessKPI Inc.:
SuccessKPI is a rapidly growing and thriving business providing a pure SaaS analytics platform for contact centers. SuccessKPI combines a rich data lake and business intelligence layer with quality management, speech and text analytics and the real time action power of playbooks to act on customer conversations. Customers can get started in minutes with this robust and actionable customer experience management platform.
For more information on SuccessKPI, please visit us at successkpi.com
SuccessKPI is and Equal Opportunity Employer – M/ W / D/ V / GI / S O / A
APPLY FOR THIS JOB