Job description
Work Schedule
Other
Environmental Conditions
Office
Job Description
Summarized Purpose:
We are offering an opportunity for a Mid-Level Data Engineer to design, build, test, tune, and support production data pipelines using PySpark, Python, advanced SQL, AWS data services, secure data handling practices, and AI-assisted data engineering capabilities.
Education/Experience:
- Bachelor's degree or equivalent in Computer Science, Information Technology, Data Engineering, or related field
- 3-5 years of experience in data engineering, ETL development, SQL, AWS data platforms, or production data pipeline support
Major Job Responsibilities:
- Develop, test, tune, and maintain ETL and data pipelines using PySpark, Python, SQL, and AWS services
- Support ingestion and transformation of flat files, relational databases, APIs, data warehouses, and enterprise data sources
- Collaborate with business analysts, data architects, QA, DevOps, and senior engineers to implement source-to-target mappings and data solutions
- Implement CDC, incremental load design, idempotent pipeline processing, and data reconciliation patterns for reliable data movement
- Maintain technical documentation, mapping specifications, data catalog updates, runbooks, automated tests, and release support materials
Knowledge, Skills, and Abilities:
- Hands-on experience with PySpark, Python, advanced SQL, ETL best practices, data modeling, and large-scale data processing
- Deep knowledge of Redshift performance tuning including distribution keys, sort keys, compression encoding, Spectrum, materialized views, WLM, vacuum, and analyze
- Strong knowledge of Athena optimization including partition pruning, file formats, compression, schema evolution, and cost-efficient query design
- Strong understanding of DynamoDB data modeling, access-pattern-based design, capacity planning, GSIs/LSIs, TTL, Streams, and performance tuning
- Exposure to secure PHI/PII handling including encryption, access controls, auditability, retention, masking, and de-identification where applicable
- Strong analytical, troubleshooting, documentation, communication, and cross-functional collaboration skills
Must Have Skills:
- PySpark, Python, advanced SQL, ETL development, and data pipeline implementation experience
- AWS data services experience including S3, Glue, Lambda, Step Functions, ECS, DynamoDB, Redshift, PostgreSQL, SQL Server, and Athena integration
- Flat-file ingestion, source-to-target mapping, transformation logic, CDC, incremental loads, idempotent processing, reconciliation, and data quality checks
- CI/CD, GitHub workflows, automated testing, and release management for data pipelines and database changes
- Problem-solving, production support, debugging, documentation, and Agile delivery skills
Good to Have Skills:
- Exposure to AI-assisted mapping automation and use of LLMs for data cleaning, data quality checks, transformation logic, or documentation
- Familiarity with RAG patterns, embeddings, vector databases, semantic search, or AI-enabled data discovery solutions
- Understanding of healthcare data standards such as HL7, FHIR, CCD, claims data, EMR extracts, clinical trial data, and patient de-identification
- Familiarity with infrastructure as code such as Terraform or CloudFormation, plus Databricks, Snowflake, streaming, observability, or DevOps practices
Working Hours:
- India: 05:30 PM to 02:30 AM IST
- Philippines: 08:00 PM to 05:00 AM PHT
This job post has been translated by AI and may contain minor differences or errors.