AI Solution and Platforms Junior Architect/Engineer

- Pepsico
- India

Yesterday 2026/09/11

Complete Questionnaire

Apply on company site

Other Business Support Services

Create a job alert for similar positions

Job alert turned off. You won’t receive updates for this search anymore.

Undo

Job description

Overview The Junior AI Observability Architect is an execution-focused engineer who designs, builds, and operates observability capabilities within a defined domain of the enterprise AI observability platform. Working under the strategic direction of the Senior AI Observability Architect , this role translates architecture blueprints into production-grade instrumentation, telemetry pipelines, dashboards, quality gates, and safety signals across agentic AI systems. The junior architect is a hands-on engineer who codes, integrates, tests, and iterates — owning feature-level delivery within one or more specialization tracks while developing a growing understanding of the full observability platform. They are a technical practitioner first, with an emerging architect mindset. Responsibilities 1. Observability Platform Engineering & OTEL Integration (25%) Implement OpenTelemetry (OTEL) instrumentation within assigned agent frameworks or platforms — including custom exporters, span enrichers, semantic conventions, and context propagation hooks. Build and maintain telemetry pipeline components (collectors, processors, exporters) that route metrics, logs, traces, and semantic signals to central observability backends. Integrate OTEL with enterprise agentic platforms as assigned — which may include Salesforce AgentForce, ServiceNow, Microsoft Agent 365, or internal frameworks — following architecture blueprints set by the L11. Develop and maintain observability dashboards, alerting rules, and SLO/SLA definitions for the assigned sub-domain, ensuring signal quality and low false-positive rates. Participate in on-call rotations and incident response for the observability platform — contributing to RCA documentation and runbook improvement. Write unit, integration, and end-to-end tests for all telemetry components; maintain >80% test coverage across owned services. 2. Safety, Security & Red Teaming Support (15%) Instrument safety-critical signal capture within assigned pipelines — including guardrail trigger rates, policy violation events, prompt injection detections, and hallucination flags. Support red team exercises by building observability hooks that capture adversarial test results, attack surface telemetry, and behavioral deviation signals in real time. Implement secure trace handling for sensitive AI decision events — applying data masking, PII redaction, and audit-log retention policies as defined by the security architecture. Assist in maintaining the Security Observability Playbook — documenting findings, updating escalation paths, and contributing to incident classification procedures. Monitor agent-to-agent protocol traffic (A2A, UCP, AP2) for anomalous communication patterns and flag deviations for review by the L11 architect and security team. 3. Responsible AI (RAI) & Governance Signal Instrumentation (10%) Implement RAI signal collectors within assigned agent workflows — capturing fairness indicators, bias detection outputs, explainability scores, and content safety classifications. Maintain RAI telemetry pipelines and ensure data quality, completeness, and timeliness of governance signals feeding into compliance dashboards. Contribute to audit-readiness work by ensuring all AI decision traces within the assigned domain include required governance metadata and are retained per policy. Support gap analyses by comparing current RAI signal coverage against governance framework requirements and flagging coverage gaps to the L11. 4. Quality Engineering for Agentic Solutions — Post Go-Live & Continuous QE (15%) Build and maintain quality gate components within CI/CD pipelines — using observability data to detect performance regressions, behavioral drift, and SLA breaches before they reach production. Instrument and monitor Skill Evaluations (evals) across the Memory, Skills, and MCP harness stack — collecting eval results, tracking pass/fail trends, and alerting on regression thresholds. Implement continuous quality monitoring for post-go-live agentic solutions — tracking agent success rate, tool-call fidelity, latency distributions, and user outcome proxies. Conduct structured testing of new agent capabilities using standardized eval harnesses — documenting results and feeding findings into quality improvement cycles. Develop automated quality reports and quality metric dashboards for stakeholder review, surfacing trends and anomalies in agent behavior over time. 5. Memory, Skills, MCP & Harness Engineering Observability (10%) Instrument agent memory operations (read/write latency, cache hit rates, memory drift) across episodic, semantic, and working memory backends within the assigned scope. Add trace instrumentation to MCP server interactions — tagging tool registrations, skill invocations, context injections, and result returns with semantic OTEL attributes. Capture harness execution telemetry for self-evolving and RL systems — logging reward signals, policy update events, environment transitions, and convergence indicators. Monitor skill eval harness execution pipelines — detecting flaky evals, environment setup failures, and result inconsistencies that could mask real capability regressions. 6. Data Science & Python Engineering (10%) Write production-grade Python for observability tooling — custom OTEL exporters, signal aggregators, anomaly detectors, and data transformation pipelines — adhering to team engineering standards. Apply basic statistical and data science methods to telemetry data — time-series analysis, threshold tuning, distribution characterization — to improve signal quality and alerting precision. Contribute to Python SDK and library development that simplifies OTEL onboarding for agent developers across the organization. Participate in code reviews, apply test-driven development practices, and continuously improve the quality and maintainability of the observability codebase. 7. Agent Fleet, Physical AI & Multi-Modal Observability (5%) Implement telemetry for agent fleet coordination — capturing spawn/termination events, inter-agent communication traces, load distribution metrics, and fleet health indicators. Contribute to observability instrumentation for physical AI pipelines (edge inference, sensor fusion, robotics control loops) as directed — focusing on latency, reliability, and data quality signals. Add OTEL instrumentation to multi-modal model pipelines — tracing vision, audio, and text input processing stages and capturing cross-modal alignment quality signals. 8. Agentic Marketplace, Registry & A2A / UCP / AP2 Observability (5%) Instrument the Agentic Marketplace and Agent Registry with usage telemetry — tracking agent invocations, capability health scores, adoption trends, and dependency relationships. Implement protocol-level observability for A2A (Agent-to-Agent), UCP, and AP2 communication flows — capturing message latency, error rates, retry patterns, and trust boundary crossings. Contribute to Marketplace Observability Dashboard development — building data connectors, metric calculations, and visualization components. 9. Collaboration, Integration & Continuous Learning (5%) Collaborate closely with AI platform engineers, data scientists, SRE, and product teams to gather requirements, align on telemetry standards, and resolve integration friction. Participate in agile ceremonies — sprint planning, stand-ups, retrospectives — contributing to estimation, dependency identification, and delivery transparency. Stay current with emerging observability frameworks, OTEL specifications, agent communication protocols, and AI safety research — sharing learnings with the team regularly. Contribute to internal documentation, engineering wikis, and onboarding guides for the observability platform. Qualifications Bachelor's or Master's degree in Computer Science, Software Engineering, AI/ML, Data Science, or a related technical field. 11+ years of experience in software engineering, platform engineering, or data engineering — with at least 2 years of hands-on work in observability, monitoring, or distributed systems. Demonstrated ability to deliver production-grade software in a team environment; track record of completing complex technical features end-to-end. Python Proficiency: Strong Python engineering skills — writing clean, testable, maintainable production code; familiarity with async patterns, type hints, and modern Python tooling (Poetry, Ruff, pytest). Observability Fundamentals: Solid working knowledge of the three pillars of observability (metrics, logs, traces); ability to instrument services with OpenTelemetry (OTEL) SDKs; understanding of trace context propagation and semantic conventions. Distributed Systems: Working knowledge of microservices, event streaming (Kafka or equivalent), REST/gRPC APIs, and containerized deployment (Docker, Kubernetes). Cloud Platforms: Hands-on experience with at least one major cloud provider (Azure, AWS, or GCP) — including managed services, IAM basics, and cost awareness. CI/CD & DevOps: Experience building or contributing to CI/CD pipelines; familiarity with GitOps, infrastructure-as-code concepts, and automated testing frameworks. Data Fundamentals: Ability to query, analyze, and visualize time-series and log data using tools such as Grafana, Datadog, Splunk, Prometheus, or equivalent. Hands-on experience with agentic AI frameworks (LangChain, LangGraph, AutoGen, Semantic Kernel, CrewAI, or equivalent). Contributions to open-source observability projects or OTEL community. Familiarity with reinforcement learning concepts, self-supervised learning, or model fine-tuning workflows. Experience with security tooling relevant to AI (adversarial robustness libraries, LLM safety frameworks, or red-team toolkits). Exposure to Responsible AI frameworks, fairness evaluation libraries (Arize, Fairlearn, AI Fairness 360), or explainability tools (SHAP, LIME). Experience in a fast-paced AI platform, MLOps, or LLMOps role with production deployment responsibilities.

This job post has been translated by AI and may contain minor differences or errors.

Apply on company site Email to Friend Complete Questionnaire

Compare your profile with other applicants

Cancel

You’ve reached the maximum limit of 15 job alerts. To create a new alert, please delete an existing one first.

MANAGE

Job alert created for this search. You’ll receive updates when new jobs match.

Manage alerts

Are you sure you want to unapply?

You'll no longer be considered for this role and your application will be removed from the employer's inbox.