The Onyx Research Data Platform organization represents a major investment by GSK R&D and Digital & Tech, designed to deliver a step change in our ability to leverage data, knowledge, and prediction to find new medicines. We are a full-stack shop consisting of product and portfolio leadership, data engineering, infrastructure and DevOps, data / metadata / knowledge platforms, and AI/ML and analysis platforms, all geared toward:
- Building a next-generation data experience for GSK’s scientists, engineers, and decision-makers, increasing productivity, and reducing time spent on “data mechanics”
- Providing best-in-class AI/ML and data analysis environments to accelerate our predictive capabilities and attract top-tier talent
- Aggressively engineering our data at scale to unlock the value of our combined data assets and predictions in real-time
Data Engineering is responsible for the design, delivery, support, and maintenance of industrialised automated end to end data services and pipelines. They apply standardised data models and mapping to ensure data is accessible for end users in end-to-end user tools through use of APIs. They define and embed best practices and ensure compliance with Quality Management practices and alignment to automated data governance. They also acquire and process internal and external, structure and unstructured data in line with Product requirements.
A Senior NLP Data Engineer is a leading technical contributor who can consistently take a poorly defined business or technical problem, work it to a well-defined data problem / specification, and execute on it at a high level. They have a strong focus on metrics, both for the impact of their work and for its inner workings / operations. They are a model for the team on best practice for software development in general (and data engineering in particular), including code quality, documentation, DevOps practices, and testing, and consistently mentor junior members of the team. They ensure robustness of our services and serve as an escalation point in the operation of existing services, pipelines, and workflows
Key Responsibilities :
- Designs, builds, and operates data tools, services, workflows, etc that deliver high value through the solution to high-impact AI-driven products by leveraging modern data engineering tools (e.g. Spark, Kafka, Storm, …) and orchestration tools (e.g. Google Workflow, AirFlow Composer)
- Partners with AIML and knowledge graph platform team to build, test, and deploy NLP and GenAI pipelines, systems and solutions
- Apply graph-based data modelling techniques for efficient organization, integration, and data retrieval to ensure system flexibility and maintainability
- Produces well-engineered software, including appropriate automated test suites, technical documentation, and operational strategy
- Diverse problem solver who surfaces opportunities to reuse modular code and develop microservices to drive efficiencies
- Provides input into the roadmaps of upstream teams (e.g. Data Platforms, DataOps, DevOps) to help improve the overall program of work
- Ensures consistent application of platform abstractions to ensure quality and consistency with respect to logging and lineage
- Fully versed in coding best practices and ways of working, and participates in code reviews and partnering to improve the team’s standards
- Adheres to QMS framework and CI/CD best practices and helps to guide improvements to them that improve ways of working
- Provides leadership to team members to help others get the job done right
Why you?
Basic Qualifications:
We are looking for professionals with these required skills to achieve our goals:
- Bachelors’ degree in Data Engineering, Computer Science, Software Engineering, or related discipline
- 5+ years of data engineering experience in industry
- Knowledge of NLP and GenAI techniques and experience of processing unstructured data, using vector stores, and approximate retrieval
- Experience with building end-to-end systems based on machine learning or deep learning methods
- Experience overcoming high volume, high compute challenges
- Familiarity with orchestrating tooling
- Cloud experience (e.g., AWS, Google Cloud, Azure)
- Experience in automated testing and design
- Experience with DevOps-forward ways of working
- Deep knowledge and use of at least one common programming language: e.g., Python, Scala, Java
- Deep experience with common big data tools (e.g., Spark, Kafka, Storm, …)
- Proven experience with machine learning algorithms and NLP frameworks like Pytorch, Tensorflow, Spacy, etc.
- Application experience of CI/CD implementations using git and a common CI/CD stack (e.g., Jenkins, CircleCI, GitLab, Azure DevOps) • Experience with agile software development environments using tools like Jira and Confluence
- Experience with Infrastructure as a Code and automation tools (i.e. Terraform)
Preferred Qualifications:
If you have the following characteristics, it would be a plus:
- Master's or PhD in Data Engineering, Computer Science, Software Engineering, or related discipline
- Good understanding of ontologies and semantic harmonization of data across sources
- Experience implement Generative AI solutions a huge plus
- Proven track record of working with knowledge graphs and graph databases, and in general good understanding of database concepts
- Proficiency in semantic web technologies (SPARQL, RDF, OWL) and harmonization of data
- Experience working with complex biomedical datasets, including genomics, proteomics, and high-throughput screening