In this role, you’ll make an impact in the following ways:
- Provide L2/L3 production support for enterprise applications and ensure platform stability, resiliency, and availability.
- Monitor application health, system performance, batch jobs, interfaces, and alerts using enterprise monitoring and observability tools.
- Investigate, troubleshoot, and resolve production incidents within defined SLAs.
- Perform root cause analysis (RCA) for recurring issues and drive permanent fixes.
- Analyze production logs, identify failure patterns, and create actionable dashboards to improve service monitoring and incident response.
- Coordinate with development, infrastructure, database, network, and business teams for issue resolution.
- Support application deployments, change requests, weekend releases, and post-release validations.
- Maintain incident, problem, and change records in service management tools.
- Drive continuous service improvement through automation, process optimization, and proactive monitoring.
- Participate in on-call support and major incident management as required.
- Prepare operational reports, service health summaries, and stakeholder communications.
- Write and analyze SQL queries for data validation, issue investigation, and production troubleshooting.
- Use Unix/Linux commands and scripting for application support, log reviews, file handling, and system-level troubleshooting.
- Leverage Splunk extensively for log analysis, issue diagnosis, trend identification, alerting insights, and dashboard creation.
To be successful in this role, we’re seeking the following:
- Proven experience in production application support for business-critical applications.
- Strong understanding of incident management, problem management, and change management processes.
- Strong SQL skills for querying, troubleshooting, and data analysis in production environments.
- Extensive hands-on experience with Splunk for log analysis, search creation, troubleshooting, monitoring, and dashboard development.
- Strong Unix/Linux skills for navigating servers, reviewing logs, troubleshooting jobs/processes, and supporting application runtime environments.
- Experience with monitoring and alerting tools, log analysis, Grafana, and dashboard-based production support.
- Experience with ITSM tools such as ServiceNow, Jira, or similar platforms.
- Ability to analyze application, infrastructure, and integration issues across distributed systems.
- Experience supporting applications in cloud and/or on-prem environments.
- Familiarity with scripting and troubleshooting middleware/interfaces.
- Strong knowledge of release support, service recovery, and operational governance.
- Ability to work in a high-pressure environment with strong ownership and accountability.
Demonstrated ability to ramp up quickly on new applications, platforms, and support processes, with strong learning agility and immediate contribution in a fast-paced production environment. - Azure Cloud experience preferred.
- Knowledge of automation/scripting using Python, Shell, or PowerShell.
- Exposure to DevOps / SRE practices, CI/CD pipelines, and observability tooling.
- Strong communication skills with the ability to provide concise incident and executive status updates.
hackajob is partnering with BNY Mellon to fill this position. Create a profile to be automatically considered for this role—and others that match your experience.