Vishal Tyagi — Interview Prep Guide

120+Questions Covered

12Categories

94.6%Your RHCSA Score

∞Confidence Points

👤Introduction & Background

Tell me about yourself. / Walk me through your professional journey.

🔥 Always Asked

▾

Your Answer

I'm Vishal Tyagi, a BCA graduate with a strong focus on Linux administration and DevOps. My journey started when I got genuinely fascinated by how Linux systems work under the hood — not just using commands but understanding why services fail, how logs tell a story, and how infrastructure holds together. That passion pushed me to pursue the RHCSA certification, which I cleared with a score of 94.6%. From there, I've had two internships that gave me real hands-on exposure. At Xloud, I worked on Docker-based deployments and container troubleshooting across AWS and OpenStack environments. I also dealt with client-side user-management failures and supported on-premises migration activities. At CYfuture, I worked as a Linux Administrator intern where I was responsible for monitoring Linux servers, resolving L1 incidents — CPU spikes, memory pressure, disk issues, failed services. I used tools like top, ps, journalctl, and systemctl daily. I also investigated Nginx logs during a security incident and helped with root cause analysis. Beyond internships, I've built personal projects — including migrating an 8-tier microservices app from Docker Compose to Kubernetes on AWS EC2, complete with GitHub Actions CI/CD, SBOM generation, and Docker Scout scanning. I document everything on LinkedIn and Medium to reinforce my learning. I'm a fresher, so I know I still have a lot to learn in production environments — but I bring strong Linux fundamentals, a bias for hands-on practice, and a genuine curiosity about infrastructure.

Tip: Speak this in 90 seconds. Don't rush. End with "I'm excited to bring this into a real team environment."

What are your day-to-day responsibilities in your previous project/internship?

Common

▾

Your Answer

At CYfuture (Linux Admin Intern), my day would typically involve: • Morning: Checking server dashboards for any overnight alerts — CPU, memory, disk utilization spikes. • Incident response: When L1 alerts fired, I'd SSH in, run top/ps to check processes, use journalctl -xe to read service logs, and systemctl status to check what failed. • I had a specific case where Nginx was throwing 502s — I dug into /var/log/nginx/error.log and identified a backend service was crashlooping. • Documentation: After every incident, I documented the steps and root cause so the team could refer back. At Xloud (DevOps intern), I was more involved in: • Helping deploy containerized apps on Docker, troubleshooting container issues. • Fixing user management misconfigurations on cloud environments. • Supporting on-premise to cloud migration tasks.

What percentage of your work is coding/scripting vs manual work?

Common

▾

Your Answer

In my Linux Admin role at CYfuture, it was roughly 70% manual operational work (monitoring, incident response, log analysis) and 30% scripting — Bash scripts for automating repetitive checks, health checks, and log parsing. In my personal projects, it skews more toward scripting and configuration — writing GitHub Actions workflows, Kubernetes manifests, Dockerfiles, and Python scripts for automation. I want to shift that ratio more toward automation as I grow — I believe good ops means automating yourself out of repetitive tasks.

Why did you move from development to DevOps? / Why DevOps?

Common

▾

Your Answer

Honestly, I was never purely in development — my BCA gave me programming exposure, but what excited me more was understanding how software runs rather than just writing it. When I started digging into Linux internals during RHCSA prep — how processes communicate, how the kernel manages resources, how services depend on each other — I realized I wanted to be on the infrastructure and reliability side. DevOps sits at the intersection of that systems thinking and software automation, which is exactly where I want to be. I find more satisfaction in building a pipeline that deploys reliably than writing app logic.

What kind of workload have you run on AWS?

Common

▾

Your Answer

Primarily through my personal projects. In my 8-tier microservices project: • EC2: Used as Kubernetes worker nodes for the cluster. • RDS (PostgreSQL): Connected as a stateful backend for the application. • S3: Used for storing Docker build artifacts and logs. I've also studied and practiced with Lambda and DynamoDB through AWS Cloud Quest. I have theoretical + lab-level understanding of Route53 for DNS routing. I want to be transparent — I haven't managed large production workloads independently on AWS, but I understand the architecture and have hands-on experience with the core services through my projects.

Honesty tip: Never claim more than you've done. Interviewers will probe with follow-up questions and catch you. Being honest about your level builds trust.

☁️AWS Questions

Rate yourself on AWS from 1–5.

⚠️ Tricky

▾

Your Answer

I'd rate myself a 2.5 to 3 out of 5. I have hands-on experience with EC2, RDS, S3, Route53, and GitHub Actions targeting AWS in my projects. I understand VPC, subnets, security groups, and IAM at a conceptual and lab level. I've also done AWS Cloud Quest. However, I haven't managed large-scale production AWS environments independently, so I'd be dishonest to rate myself higher. I can work with the core services confidently, and I learn quickly — I'm ready to be upskilled into anything you need.

Tip: Never say 4 or 5 as a fresher — they will ask expert-level questions and you'll be caught. 2.5–3 with honest explanation is respected.

How would you design an AWS architecture for a frontend + backend application?

🔥 Hot

▾

Your Answer

Here's how I'd approach a typical 3-tier architecture: Frontend: • S3 for static hosting + CloudFront CDN for global distribution and HTTPS termination via ACM. • Route53 for DNS pointing to CloudFront. Backend: • Application runs on EC2 (or ECS/EKS for containers) in a private subnet. • ALB (Application Load Balancer) in public subnet routes traffic to backend. • Auto Scaling Group attached to the ALB for horizontal scaling. Database: • RDS in a private subnet, only accessible via the backend's security group. • Multi-AZ enabled for high availability. Security: • Public subnets: only ALB and NAT Gateway. • Private subnets: EC2 instances and RDS. • Security groups tightly scoped — ALB allows 443 from internet, EC2 only allows port 8080 from ALB SG, RDS only allows 5432 from EC2 SG. • WAF attached to CloudFront or ALB. Secrets: • Database credentials stored in AWS Secrets Manager, fetched at runtime by the app.

If an application runs in private subnet, how will it install packages from the internet?

🔥 Hot

▾

Your Answer

Through a NAT Gateway placed in the public subnet. The flow is: Private EC2 → Route Table (default route 0.0.0.0/0 → NAT Gateway) → NAT Gateway → Internet Gateway → Internet. The NAT Gateway has an Elastic IP. Responses come back through the same path. This way, the EC2 in the private subnet can initiate outbound traffic (to install packages, pull container images) but no inbound connection can be initiated from the internet to the EC2 directly. Alternatively, for AWS-specific packages, you can use VPC Endpoints (e.g., for S3 or ECR) which avoids internet entirely — traffic stays within AWS backbone.

What is the difference between Elastic IP and Public IP?

▾

Your Answer

Public IP: • Automatically assigned when you launch an EC2 in a public subnet (if enabled). • Changes every time you stop and start the instance. • Free while instance is running. Cannot be kept after termination. Elastic IP: • A static public IP that you explicitly allocate to your AWS account. • It doesn't change — survives stop/start cycles. • You can re-associate it to another instance (useful for failover scenarios). • AWS charges you when the EIP is NOT associated with a running instance. Use case: If you have a DNS record pointing to your server's IP, you need an Elastic IP — otherwise the IP changes every time you restart the instance and your DNS breaks.

What is the difference between EBS, EFS, and S3?

🔥 Hot

▾

Your Answer

EBS (Elastic Block Store): • Block storage, like a hard disk attached to a single EC2 instance. • Low latency, high IOPS — great for OS, databases, application files. • Tied to an Availability Zone. Can only attach to one EC2 at a time (except multi-attach on io1/io2). • Persists independently if you set it to — data survives instance termination if "Delete on Termination" is off. EFS (Elastic File System): • Managed NFS (Network File System) — can be mounted by MULTIPLE EC2s simultaneously. • Automatically scales — no need to provision size. • Accessible across multiple AZs in a region. • Slower than EBS but great for shared workloads (web servers sharing content, shared config files). S3 (Simple Storage Service): • Object storage — not a filesystem. You store objects (files) with keys. • Virtually unlimited storage, accessible over HTTP/HTTPS. • Not mountable as a filesystem natively (you can use s3fs but it's not ideal). • Great for backups, static assets, data archival, logs. Simple way to remember: EBS = your laptop's SSD, EFS = a shared network drive, S3 = Google Drive for apps.

Can EBS be attached to multiple instances? What happens to EBS/EFS if EC2 is terminated?

⚠️ Tricky

▾

Your Answer

EBS — Multiple Instances: By default, NO. EBS volumes can only be attached to one instance at a time. Exception: io1/io2 volumes support Multi-Attach — they can be attached to up to 16 EC2s in the same AZ. But this requires careful application-level handling to avoid data corruption. What happens on EC2 termination: • EBS root volume: By default, deleted on termination (Delete on Termination = true). If you set it to false, the volume persists as an unattached volume — you pay for it. • Additional EBS volumes: By default, they persist (Delete on Termination = false). You can attach them to another instance. EFS on EC2 termination: • EFS is completely independent of EC2. Terminating an EC2 has zero effect on EFS. The file system and all data remain. It's a managed service — it persists until you explicitly delete it. • Other instances can still mount and access EFS normally.

How do you connect EC2 with RDS securely?

▾

Your Answer

Several layers: 1. Network isolation: Place RDS in a private subnet. No public access. It should have no public IP. 2. Security Groups: Create a dedicated SG for RDS. Allow inbound on port 5432 (PostgreSQL) or 3306 (MySQL) ONLY from the EC2's security group ID — not from CIDR, from the SG reference. This way, only instances with that SG can connect. 3. IAM Database Authentication: Instead of username/password, you can use IAM roles to generate short-lived auth tokens. The EC2's IAM role is granted rds:connect permission. 4. Credentials in Secrets Manager: If using traditional credentials, store them in AWS Secrets Manager. The app fetches the secret at runtime. Credentials never go in environment variables or code. 5. Encryption in transit: Enable SSL/TLS on the RDS connection. AWS provides RDS certificates. 6. Encryption at rest: Enable AES-256 encryption on the RDS volume using KMS.

What is WAF and where would you place it?

▾

Your Answer

WAF = Web Application Firewall. It protects web applications from common attacks like SQL injection, XSS (cross-site scripting), bot traffic, and HTTP floods. AWS WAF works at Layer 7 (application layer). You attach it to: • CloudFront (best for global protection, inspects before traffic even hits your region) • Application Load Balancer (ALB) • API Gateway You define WebACLs with rules — either AWS Managed Rules (pre-built rule groups for OWASP Top 10) or custom rules based on IP, headers, request body patterns, rate limits, etc. I'd place WAF at CloudFront for static/frontend traffic and at ALB for backend API traffic. This gives defense-in-depth.

What is ECS vs EKS?

▾

Your Answer

ECS (Elastic Container Service): • AWS's native container orchestration service. • Simpler to set up and manage. • AWS-specific — less portable. • Two launch types: EC2 (you manage nodes) and Fargate (serverless containers — AWS manages the underlying infra). • Great for teams already deep in AWS ecosystem who don't need Kubernetes complexity. EKS (Elastic Kubernetes Service): • Managed Kubernetes on AWS. AWS manages the control plane. • More complex to set up but gives you the full Kubernetes ecosystem. • Cloud-agnostic — your Kubernetes knowledge transfers to GKE (Google), AKS (Azure). • Better for larger, more complex workloads with multi-cloud requirements. My experience: I've worked with Kubernetes (EKS-style) in my 8-tier microservices project on EC2 nodes. ECS I understand conceptually through studies. When to use which: ECS for simpler AWS-native deployments. EKS for teams that need Kubernetes features — custom controllers, Helm, complex scheduling, or portability.

Q10

How do you debug a failing Lambda function?

▾

Your Answer

Step 1 — CloudWatch Logs: Every Lambda invocation writes to CloudWatch Logs. I'd go to /aws/lambda/{function-name} log group and look at the log stream for the failing invocation. Check for error messages, stack traces, timeouts. Step 2 — Check the error type: • Timeout: Function ran longer than configured timeout. Increase timeout or optimize code. • Out of Memory: Increase memory allocation. • Permissions error: Lambda execution role missing IAM permissions. Check CloudTrail for AccessDenied events. • Runtime error: Unhandled exception in code. Fix the code logic. Step 3 — Lambda Test: Use the built-in test functionality in the AWS Console with sample event payloads to reproduce the error. Step 4 — X-Ray tracing: Enable AWS X-Ray for distributed tracing — useful if Lambda is calling other services (RDS, DynamoDB, API) to see where latency or failures occur. Step 5 — Check Dead Letter Queue (DLQ): If async invocations are failing, configure a DLQ (SQS or SNS) to capture failed events for analysis.

Q11

What is hub-and-spoke architecture in AWS? Can two spokes communicate with each other?

⚠️ Tricky

▾

Your Answer

Hub-and-Spoke in AWS is a network topology typically implemented using AWS Transit Gateway (TGW). Hub: Central Transit Gateway — acts as a regional network hub. Spokes: Multiple VPCs (dev, staging, prod, shared-services, on-prem) attached to the TGW via TGW Attachments. Benefits: Instead of complex VPC peering mesh (N*(N-1)/2 connections for N VPCs), you have one central router. Centralized routing, security, and inspection. Can two spokes communicate directly? • By default in Transit Gateway, YES — all attached VPCs can route to each other through the TGW (if the route tables allow it). • But you can isolate them: Create separate TGW route tables. Assign spokes to different route tables. A spoke only in "isolated" route table can communicate with hub (shared services) but NOT with other isolated spokes. Use case: Production VPC should not communicate with Dev VPC, but both need access to a Shared Services VPC (DNS, monitoring). You'd put Prod and Dev in isolated route tables, and Shared Services in a centralized table with propagation to both.

Q12

How do you manage SSL certificates? Where else can you generate SSL certificates apart from ACM?

▾

Your Answer

In AWS: • ACM (AWS Certificate Manager): Free certificates for use with CloudFront, ALB, API Gateway. Automatically renews. Cannot be downloaded (can't install on EC2 directly). For EC2 / non-AWS services, you'd use: • Let's Encrypt with Certbot: Free, automated, 90-day certificates. Very common for self-managed servers. Certbot handles DNS or HTTP challenge validation and auto-renewal via cron. • Self-signed certificates: For internal services, testing. Not trusted by browsers but useful for service-to-service encryption. • HashiCorp Vault: Can act as a private CA and issue short-lived certificates. Great for internal mTLS between microservices. • OpenSSL: Generate your own CA and sign certificates manually. Used for internal PKI. In Kubernetes, I've worked with cert-manager — it integrates with Let's Encrypt and manages certificate issuance and renewal as Kubernetes resources (Certificate, Issuer, ClusterIssuer).

⚙️Kubernetes Questions

Rate yourself on Kubernetes. Core components of Kubernetes.

🔥 Hot

▾

Your Answer

Rating: 2.5 to 3 out of 5. I've deployed an 8-tier microservices app on Kubernetes on AWS EC2 — including stateful sets, deployments, services, PVCs, network policies, and GitHub Actions CI/CD. I understand the concepts well but haven't managed large-scale production Kubernetes clusters. Core Components: Control Plane (Master): • API Server: The front door of Kubernetes. All kubectl commands hit the API server. Validates and processes requests. • etcd: Distributed key-value store. Holds ALL cluster state — every object, every config. • Scheduler: Watches for unscheduled Pods and assigns them to nodes based on resource requests, taints, affinity. • Controller Manager: Runs controllers that watch the actual state and reconcile to desired state (Deployment controller, ReplicaSet controller, etc.). Worker Nodes: • kubelet: Agent on each node. Receives pod specs from API server, ensures containers are running. • kube-proxy: Handles network rules on each node — manages Service routing using iptables/IPVS. • Container Runtime: Docker, containerd, or CRI-O — actually runs containers. Add-ons: • CoreDNS: Cluster DNS — resolves service names like my-service.default.svc.cluster.local • Ingress Controller: Routes external HTTP/HTTPS traffic to services.

Difference between Deployment and StatefulSet.

🔥 Hot

▾

Your Answer

Deployment: • For stateless applications (web servers, APIs, microservices). • Pods are interchangeable — they don't have unique identities. • Pod names are random: app-7b9c4d-xyz. • Can scale up/down in any order. • Rolling updates replace pods in any order. • No guaranteed persistent storage per pod. StatefulSet: • For stateful applications (databases, Kafka, ZooKeeper, Elasticsearch). • Pods have stable, unique identities: app-0, app-1, app-2. • Pods start and stop in order (0 first, then 1, then 2). • Each pod gets its own PVC (Persistent Volume Claim) via volumeClaimTemplates. • If a pod is deleted, it comes back with the SAME name and re-attaches to the SAME PVC. • DNS: Each pod gets a stable DNS: app-0.my-service.default.svc.cluster.local. In my project: I used StatefulSet for Kafka and Redis inside Kubernetes, with Deployments for the stateless application tiers.

What happens to PVC if a StatefulSet pod is deleted?

⚠️ Tricky

▾

Your Answer

This is an important behavior to understand: When a StatefulSet pod is deleted (not the StatefulSet itself): • Kubernetes recreates the pod with the SAME name (e.g., kafka-0). • The PVC is NOT deleted — it persists. • The new pod re-binds to the same PVC automatically. • This is the whole point of StatefulSet — persistent identity and storage. When the StatefulSet itself is deleted: • The pods are deleted. • BUT the PVCs are NOT automatically deleted — this is by design (to prevent accidental data loss). • You must manually delete the PVCs if you want to clean up storage. This is different from Deployments where pods don't have PVC binding by default, and volumes are typically ephemeral (emptyDir) or shared. Gotcha: If you delete a StatefulSet and recreate it, the new pods will find existing PVCs with the same names and rebind to them — picking up old data. This can be useful or problematic depending on the situation.

What causes CrashLoopBackOff? How do you debug it?

🔥 Hot

▾

Your Answer

CrashLoopBackOff means the container starts, crashes, Kubernetes tries to restart it, it crashes again — and the wait time between restarts keeps growing (exponential backoff). Common Causes: 1. Application error: Bug in the app, missing dependency, wrong entrypoint command. 2. Missing config/env: Required environment variable or config missing. 3. Missing secret: Secret referenced in the pod spec doesn't exist. 4. Wrong image: Image doesn't exist, wrong tag, or registry pull failure. 5. OOMKilled: Container exceeds memory limit, gets killed. 6. Liveness probe failure: App takes too long to start, liveness probe kills it before it's ready. 7. Port conflict or permission issue inside container. Debugging Steps: 1. kubectl describe pod : Look at Events section — shows why it's failing, OOMKilled, ImagePullBackOff, etc. 2. kubectl logs : Current logs. If container has already crashed: kubectl logs --previous (logs from last crashed instance). 3. kubectl get pod -o yaml: Check resource limits, env vars, volume mounts, image name. 4. If you can exec in: kubectl exec -it -- /bin/sh to manually check file system, run the app command. 5. Temporarily override command in the manifest to something like sleep 3600 to keep the container alive and debug inside. Will updating manifest fix it automatically? YES — Kubernetes will detect the change, roll out new pods with the updated spec, and old crashlooping pods will be terminated.

What is OOMKilled? Difference between CPU throttling vs memory OOM.

⚠️ Tricky

▾

Your Answer

OOMKilled (Out Of Memory Killed): When a container's memory usage exceeds its memory LIMIT, the Linux kernel's OOM killer terminates the process. In Kubernetes, you see status: OOMKilled. The container is immediately killed — no graceful shutdown. CPU Throttling: • CPU is a compressible resource. If a container exceeds its CPU limit, Kubernetes throttles it — slows it down. The container is NOT killed. • It just runs slower. You'll see high CPU throttle percentage in metrics. Memory OOM: • Memory is NOT compressible. Once a container exceeds its memory limit, it gets killed immediately. • This is why memory limits are more dangerous to set too low. Key insight: Set CPU requests/limits carefully — too tight and you throttle performance. For memory, always set limits to avoid one container consuming all node memory, but make sure limits are higher than peak usage. Troubleshooting: kubectl describe pod → look for OOMKilled in Last State. Fix: increase the memory limit, or find the memory leak in the application.

How do you delete a namespace stuck in Terminating state? How do you remove finalizers?

⚠️ Tricky

▾

Your Answer

A namespace gets stuck in Terminating when there are resources inside it that have finalizers — Kubernetes won't delete the namespace until all finalizers are cleared. Step 1: Check what's holding it: kubectl get all -n Step 2: Get the namespace JSON and remove the finalizers field: kubectl get namespace -o json > ns.json Edit ns.json — remove the "kubernetes" entry from spec.finalizers array. Step 3: Apply via the API server directly (bypass normal deletion flow): kubectl replace --raw "/api/v1/namespaces//finalize" -f ns.json Or use kubectl proxy: kubectl proxy & curl -H "Content-Type: application/json" -X PUT --data-binary @ns.json http://127.0.0.1:8001/api/v1/namespaces/<name>/finalize Why this works: Finalizers are metadata on objects that tell Kubernetes "don't delete me until I say so" — usually used by controllers to clean up external resources. If the controller is gone but the finalizer remains, the object is stuck. Clearing finalizers manually tells Kubernetes "okay, delete now."

How do pods communicate? How do you secure pod communication?

▾

Your Answer

Pod Communication: • Every pod gets a unique IP address (from the Pod CIDR). • Pods can communicate with any other pod by IP directly (flat network model — no NAT between pods by default). • For service discovery: Pods use Kubernetes Services. CoreDNS resolves service names → ClusterIP → kube-proxy routes to pod. • Pod → Service → Pod is the typical pattern. Services provide stable IPs and load balancing. Securing Pod Communication: 1. Network Policies: Kubernetes-native firewall rules at Layer 3/4. Define which pods can communicate with which pods using label selectors. Example: Only allow frontend pods to reach backend pods on port 8080. Deny all other traffic. 2. mTLS (Mutual TLS): Both sides authenticate with certificates. Implemented via service meshes like Istio or Linkerd. Traffic between pods is encrypted and authenticated. 3. Namespace isolation: Put sensitive workloads in separate namespaces with strict NetworkPolicies. In my project: I implemented Kubernetes Network Policies to restrict inter-service communication to only what was needed — part of the container security layer I built.

Command to exec into a container. What if pod has multiple containers?

▾

Your Answer

Single container pod: kubectl exec -it <pod-name> -- /bin/bash (or /bin/sh if bash isn't available) Specific container in a multi-container pod: kubectl exec -it <pod-name> -c <container-name> -- /bin/bash Find container names: kubectl describe pod <pod-name> | grep -A5 "Containers:" or kubectl get pod <pod-name> -o jsonpath='{.spec.containers[*].name}'

What is HPA vs VPA? How do you optimize EKS node utilization?

▾

Your Answer

HPA (Horizontal Pod Autoscaler): • Scales the NUMBER of pod replicas based on metrics (CPU, memory, custom metrics). • If CPU > 70%, add more pods. If CPU drops, remove pods. • Works with Deployments, StatefulSets. • kubectl autoscale deployment app --cpu-percent=70 --min=2 --max=10 VPA (Vertical Pod Autoscaler): • Adjusts the CPU and memory REQUESTS/LIMITS of existing pods based on actual usage. • Good when you don't know the right resource sizes — VPA learns and recommends. • VPA can automatically update pod specs (requires pod restart) or just recommend. • Don't use HPA and VPA together on the same resource (can conflict). EKS Node Utilization Optimization: • Cluster Autoscaler or Karpenter: Automatically add/remove nodes based on pending pods. • Karpenter (newer): Faster scaling, provisions exactly the right instance type needed. • Right-sizing: Use VPA recommendations to set accurate resource requests — prevents wasted capacity. • Spot instances: Use Spot Instance Node Groups for non-critical workloads (in my project I reduced cloud costs by ~20% using this strategy). • Pod Disruption Budgets: Ensure smooth node drain without service disruption. • Bin packing: Kubernetes scheduler already does this, but proper resource requests/limits help it pack pods efficiently.

Q10

Host-based routing vs Path-based routing in Kubernetes Ingress.

▾

Your Answer

Ingress routes external HTTP/HTTPS traffic to services inside the cluster. Host-based routing: Routes based on the hostname/domain. • api.example.com → backend-service • app.example.com → frontend-service Different subdomains go to different services. Path-based routing: Routes based on the URL path. Same domain, different paths. • example.com/api → backend-service • example.com/ → frontend-service • example.com/admin → admin-service You can combine both: • api.example.com/v1 → v1-service • api.example.com/v2 → v2-service In practice: Path-based is common for monorepos where multiple microservices share a domain. Host-based is cleaner for separate subdomains per service. Can you expose a database using Ingress? No — Ingress is HTTP/HTTPS only (Layer 7). Databases use TCP. For external TCP access, you'd use a Service of type LoadBalancer directly, or a TCP Ingress controller (some support TCP/UDP passthrough). But exposing a DB externally is generally bad practice — always keep it internal.

🏗️Terraform Questions

Where do you store Terraform state file? What if it's corrupted?

🔥 Hot

▾

Your Answer

Locally: By default, terraform.tfstate is stored locally. Never acceptable for team environments. Remote backends (production): • S3 + DynamoDB (most common for AWS): State file in S3 bucket. DynamoDB table for state locking (prevents concurrent applies from corrupting state). • Terraform Cloud / HCP Terraform: Managed state with locking, history, and collaboration. • Azure Blob Storage, GCS for respective clouds. Configuration:

terraform {
  backend "s3" {
    bucket         = "my-terraform-state"
    key            = "prod/infra.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}

If state file is corrupted: 1. Don't panic — if using S3 with versioning enabled (you should always enable versioning), restore a previous version. 2. terraform state list to see what Terraform knows. 3. terraform state rm <resource> to remove corrupt entries. 4. terraform import to re-import the actual resource. 5. As last resort: manually edit the JSON state file (backup first!) to remove the corrupt resource block. Prevention: Always use remote state with versioning and locking.

Terraform lifecycle stages. What are lifecycle policies?

🔥 Hot

▾

Your Answer

Terraform workflow stages: 1. terraform init — Initialize backend, download providers and modules. 2. terraform plan — Compare desired state (your .tf files) with current state. Generate execution plan. 3. terraform apply — Execute the plan. Create/update/delete resources. 4. terraform destroy — Destroy all managed infrastructure. lifecycle block — controls resource behavior:

resource "aws_instance" "web" {
  # ...
  lifecycle {
    create_before_destroy = true  # Create new before destroying old (zero-downtime)
    prevent_destroy       = true  # Block 'destroy' — protects critical resources
    ignore_changes        = [tags, ami]  # Ignore external changes to these attrs
    replace_triggered_by  = [aws_security_group.web.id]  # Replace if this changes
  }
}

• create_before_destroy: Critical for stateful resources where you need the new one up before old is gone. • prevent_destroy: Protect prod databases from accidental terraform destroy. • ignore_changes: Useful when external automation modifies certain fields — prevents Terraform from reverting them. • replace_triggered_by: Force resource replacement when a dependency changes.

What is terraform taint/untaint? What is terraform target?

▾

Your Answer

terraform taint (deprecated in newer versions, replaced by -replace): Marks a resource as tainted — forces it to be destroyed and recreated on next apply. New way: terraform apply -replace="aws_instance.web" Use case: EC2 instance is in a bad state, you want Terraform to recreate it without changing any config. terraform untaint (deprecated): Removes the taint mark — resource will not be recreated. New way: Just run terraform apply without the -replace flag. terraform target: Applies changes only to a specific resource, ignoring everything else. terraform plan -target="aws_instance.web" terraform apply -target="aws_s3_bucket.logs" Use cases: • Apply a single resource in a large infrastructure. • Debug a specific resource. • Unblock a dependency issue. Warning: Use -target carefully in production. It creates drift between state and reality for other resources. Not recommended for regular use — it's a surgical tool for emergencies.

What happens if a resource is created manually in AWS but not in Terraform? How do you import it?

⚠️ Tricky

▾

Your Answer

Scenario: Someone created an S3 bucket manually in console. Terraform doesn't know about it. What happens on terraform apply? • If your .tf files don't reference this resource: Terraform ignores it. It won't touch it. • If your .tf files DO define a resource with the same name/ID: Terraform may try to create a new one and conflict, or give you an "already exists" error. To bring it under Terraform management: Step 1: Write the resource definition in your .tf file matching the real resource. Step 2: Import it: terraform import aws_s3_bucket.my_bucket my-existing-bucket-name This tells Terraform: "that real bucket IS this resource in my state." Step 3: Run terraform plan — should show no changes if your .tf config matches reality. Step 4: Reconcile any differences in your .tf file. Will Terraform delete a manually created resource? • NO — if Terraform doesn't know about it (not in state and not in .tf files), it won't touch it. • But if you define it in .tf and run apply, Terraform will try to CREATE it → conflict error because it already exists. • Solution: import first, then apply. New in Terraform 1.5+: terraform import block in config files for declarative imports.

Why use Terraform instead of CloudFormation? What is a Terraform module?

▾

Your Answer

Terraform vs CloudFormation: • Multi-cloud: Terraform works with AWS, GCP, Azure, Kubernetes, Datadog, etc. CloudFormation is AWS-only. • Language: Terraform uses HCL (cleaner, more readable). CloudFormation uses JSON/YAML (verbose, error-prone). • Community: Terraform has a huge provider ecosystem (1000+ providers). Massive module registry. • State management: Terraform has explicit state file (more control). CloudFormation uses its own internal state (less transparent). • Plan visibility: terraform plan shows exactly what will change before applying. CloudFormation change sets are less intuitive. Terraform Modules: A module is a reusable, self-contained package of Terraform configuration. Think of it as a function/class for infrastructure.

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "5.0.0"
  name    = "my-vpc"
  cidr    = "10.0.0.0/16"
  azs     = ["us-east-1a", "us-east-1b"]
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24"]
  public_subnets  = ["10.0.101.0/24", "10.0.102.0/24"]
}

You can write your own modules or use community modules from registry.terraform.io. Good modules encapsulate complexity — you call the module with inputs, it creates all the resources.

🔄CI/CD Questions

Explain your CI/CD pipeline. What tools have you used?

🔥 Hot

▾

Your Answer

In my 8-tier microservices project, I built a CI/CD pipeline using GitHub Actions: Pipeline stages: 1. Trigger: On push to main branch or PR. 2. Lint & Test: Run unit tests, lint Dockerfiles. 3. Security scanning: • Docker Scout for vulnerability scanning of images. • SBOM (Software Bill of Materials) generation using Syft. 4. Docker Build: Multi-architecture build (amd64 + arm64) using Docker Buildx and QEMU. • Built to support both AWS Graviton (arm64) and standard EC2 for cost optimization. 5. Push to Registry: Push images to Docker Hub (or ECR). 6. Deploy: Apply Kubernetes manifests to the cluster using kubectl. The multi-arch build specifically helped optimize compute costs — Graviton instances are ~20% cheaper on AWS, and having arm64 images allowed us to use them. Tools I've used: GitHub Actions (primary), Docker, kubectl, Docker Buildx. Aware of: Jenkins (setup on nodes/agents), GitLab CI/CD (pipeline yaml), AWS CodePipeline/CodeBuild/CodeDeploy (studied). CI vs CD: CI (Continuous Integration): Automating the build, test, and scan on every code push. Ensures code integrates cleanly. CD (Continuous Delivery/Deployment): Automating the delivery of tested code to staging/production environments.

Difference between blue-green deployment and canary deployment. How do you rollback?

🔥 Hot

▾

Your Answer

Blue-Green Deployment: • Maintain two identical environments: Blue (current live) and Green (new version). • Deploy new version to Green. Test it. • Switch ALL traffic from Blue to Green at once (via load balancer update or DNS). • If issues arise, switch back to Blue instantly. • Pros: Zero downtime, instant rollback. Cons: Requires double infrastructure (costly). Canary Deployment: • Deploy new version to a SMALL subset of servers/pods (say 5% of traffic). • Monitor metrics — error rates, latency, business KPIs. • If healthy, gradually increase traffic (10%, 25%, 50%, 100%). • If issues found, route all traffic back to old version. • Pros: Low risk, gradual validation with real users. Cons: Complexity, need good monitoring. In Kubernetes: • Blue-Green: Two Deployments, switch Service selector label. • Canary: Use Ingress weight (nginx ingress supports canary annotations) or a service mesh like Istio for traffic splitting. • ArgoCD + Argo Rollouts: Makes both patterns automated with automatic analysis. Rollback: • Kubernetes: kubectl rollout undo deployment/my-app • kubectl rollout history deployment/my-app (see history) • kubectl rollout undo deployment/my-app --to-revision=3 (specific version) • Docker: Tag images with git SHA, rollback means deploying the previous image tag. • GitOps (ArgoCD): Revert the Git commit — ArgoCD syncs the old state automatically.

What is your Git branching strategy?

▾

Your Answer

In my projects I follow a simplified GitFlow: • main: Production-ready code. Protected branch — no direct pushes. • develop: Integration branch. Features merge here first. • feature/<name>: Short-lived feature branches. PR → develop. • hotfix/<name>: Emergency fixes directly from main. Merged back to both main and develop. For environments: Tags (v1.0.0, v1.1.0) on main trigger production deployments. Develop branch pushes trigger staging. Common alternative — Trunk-Based Development: • Single main branch. Short-lived feature branches (< 1 day). Frequent small merges. • Feature flags for incomplete features. Used by high-velocity teams (Google, Facebook). • Better for CI/CD because integration happens constantly. When asked which I prefer, I'd say it depends on team size — GitFlow for larger teams with release schedules, Trunk-Based for small, fast-moving teams.

How to deploy Jenkins on Kubernetes? How to scale Jenkins nodes?

▾

Your Answer

Deploy Jenkins on Kubernetes: Best way: Use the official Jenkins Helm chart.

helm repo add jenkins https://charts.jenkins.io
helm repo update
helm install jenkins jenkins/jenkins \
  --namespace jenkins \
  --create-namespace \
  --set controller.serviceType=LoadBalancer

Jenkins runs as a StatefulSet (to preserve job history and plugins on restart). PVC attached for data persistence. Scaling Jenkins with Kubernetes: The Jenkins Kubernetes Plugin is the key. Instead of static agents (worker nodes), it spins up Kubernetes pods as Jenkins agents on demand. How it works: • Jenkins master defines Pod Templates — what container image to use for agents (java, node, python). • When a build job triggers, Jenkins dynamically creates a pod in Kubernetes for that specific build. • When the build completes, the pod is destroyed. • This is infinitely scalable — Kubernetes handles pod scheduling and cluster autoscaler adds nodes as needed. Benefits: No idle agents wasting compute. Each build gets a fresh, clean environment. Multiple builds run in parallel as separate pods.

📊Monitoring & Observability

Difference between CloudWatch and CloudTrail.

🔥 Hot

▾

Your Answer

CloudWatch: • Monitoring and observability service. • Collects metrics (CPU, memory, network), logs, and events from AWS resources and applications. • You set alarms on metrics (CPU > 80% → trigger alert/action). • Log Groups store application and service logs. • CloudWatch Dashboards for visualization. • Answer: "What is my application DOING right now?" CloudTrail: • Audit and governance service. • Records every API call made in your AWS account — who did what, when, from where. • "Someone terminated my EC2" → CloudTrail tells you who (IAM user/role), when, from which IP. • Useful for security auditing, compliance, and incident investigation. • Answer: "WHO made changes to my AWS infrastructure?" Simple analogy: CloudWatch = Security cameras watching your building's operations. CloudTrail = A logbook of every person who entered, what they touched, and when.

What is Grafana? What data sources can it use? How does Prometheus collect metrics?

🔥 Hot

▾

Your Answer

Grafana: Open-source visualization and analytics platform. It doesn't store metrics itself — it connects to data sources and creates dashboards. Data sources Grafana supports: • Prometheus (most common for metrics) • Loki (logs, by Grafana Labs) • Elasticsearch / OpenSearch • InfluxDB, Graphite • AWS CloudWatch, Google Cloud Monitoring • MySQL, PostgreSQL (for business metrics) • Tempo (distributed tracing) How Prometheus collects metrics: Prometheus uses a PULL model — it scrapes HTTP endpoints. 1. Applications expose a /metrics endpoint (text format). 2. Prometheus is configured with scrape_configs — lists of targets to scrape. 3. Prometheus periodically hits those endpoints and pulls metrics. 4. Stores in its time-series database (TSDB). 5. Grafana queries Prometheus using PromQL. For things you can't scrape (short-lived jobs, batch jobs): • Pushgateway: The job pushes metrics to Pushgateway, Prometheus scrapes Pushgateway. Kubernetes integration: • kube-state-metrics: Exposes Kubernetes object state (pod status, deployment replicas). • node-exporter: System metrics from each node (CPU, disk, network). • Prometheus scrapes both automatically via service discovery.

ELK stack — experience, ELK vs EFK, Filebeat issues, ideal log format for Elasticsearch.

⚠️ Tricky

▾

Your Answer

My ELK Experience: At CYfuture, I worked with Nginx access/error logs during a security incident investigation. I analyzed log patterns — not through a managed ELK stack but by reading logs directly. My ELK knowledge is primarily from study and personal lab work, not production management. I'm transparent about that. ELK vs EFK: ELK = Elasticsearch + Logstash + Kibana • Logstash: Collects, processes (filters, transforms), and ships logs. Heavyweight — runs on JVM. EFK = Elasticsearch + Fluentd (or Fluent Bit) + Kibana • Fluentd/Fluent Bit: Lightweight log collectors, much lower resource usage. • Fluent Bit is even lighter — written in C. Perfect for containers/Kubernetes as a DaemonSet. In Kubernetes: EFK is more common because Fluent Bit as a DaemonSet collects container logs from each node efficiently. Common Filebeat issues: • File descriptor limits: Filebeat opens many files. Need to increase ulimits. • Harvester getting behind: If logs rotate faster than Filebeat reads, you lose logs. Monitor harvester lag. • Registry file issues: Filebeat tracks file positions in a registry. If it gets corrupt, Filebeat resends all logs (duplicates). • Memory usage: Filebeat can grow memory if it's behind on processing. Ideal log format for Elasticsearch: • JSON structured logs — each field is a separate key. • Always include: timestamp (ISO 8601), level (INFO/ERROR/WARN), service name, message, trace_id. • Avoid unstructured strings — they can't be queried efficiently. • Which component adds metadata: Logstash or Fluentd filters add Kubernetes metadata (pod name, namespace, labels) via the kubernetes metadata filter plugin.

🐧Linux / Git Fundamentals

Unix commands to check memory, CPU, disk, processes.

🔥 Hot

▾

Your Answer

Memory: • free -h (human-readable, shows used/available/cached/swap) • cat /proc/meminfo (detailed memory info) • vmstat -s CPU: • top or htop (real-time, process-level CPU usage) • mpstat (per-core CPU stats) • uptime (load averages — 1, 5, 15 minute) • cat /proc/cpuinfo Disk: • df -h (filesystem disk space) • du -sh /path (directory size) • iostat (disk I/O stats) • lsblk (block devices) Processes: • ps aux (all processes) • ps aux | grep nginx • pgrep -a nginx • lsof -p <pid> (files opened by process) Network: • ss -tuln (listening ports, replaces netstat) • ip addr (network interfaces) • iftop or nethogs (real-time bandwidth per process) Logs: • journalctl -u nginx -f (follow nginx service logs) • journalctl -xe (system errors) • tail -f /var/log/nginx/error.log Note: These are tools I used daily at CYfuture to diagnose incidents — so I can speak to them from real experience, not just textbook knowledge.

Difference between git clone and git pull. Steps to fix a wrong commit.

▾

Your Answer

git clone: • Creates a new local copy of a remote repository from scratch. • Used when you don't have the repo locally yet. • Downloads all history, branches, tags. git pull: • Updates an existing local repo with changes from remote. • Equivalent to git fetch + git merge (or rebase, depending on config). • Used when you already have the repo and want the latest changes. Fixing a wrong commit: 1. Fix last commit message (not pushed yet): git commit --amend -m "correct message" 2. Undo last commit, keep changes staged: git reset --soft HEAD~1 3. Undo last commit, keep changes unstaged: git reset --mixed HEAD~1 4. Undo last commit, discard all changes (DANGEROUS): git reset --hard HEAD~1 5. Revert a commit (creates a new commit that undoes — safe for shared branches): git revert <commit-hash> 6. Already pushed to shared branch? Use revert, not reset. Reset rewrites history — problematic for others who pulled. Revert adds a new commit — safe. 7. Check changes in a file: git diff HEAD~1 HEAD -- <filename> git log --follow -p <filename>

How do you transfer a file to a remote server safely? How do you verify file integrity?

▾

Your Answer

Safe file transfer: SCP (Secure Copy Protocol) — uses SSH: scp -i key.pem file.txt ubuntu@1.2.3.4:/home/ubuntu/ SFTP — interactive SSH-based file transfer: sftp -i key.pem ubuntu@1.2.3.4 rsync — better for large files or syncing directories (resumes interrupted transfers): rsync -avz -e "ssh -i key.pem" ./local-dir ubuntu@1.2.3.4:/remote-dir/ All of these use SSH under the hood — encrypted in transit. Verifying file integrity: Generate checksum before transfer: sha256sum file.tar.gz → gives hash After transfer on remote server: sha256sum file.tar.gz → compare hashes If hashes match → file is intact. If different → file was corrupted or tampered. Also: md5sum (faster but weaker, don't use for security), sha512sum (stronger). In practice at CYfuture: When uploading config files or scripts to servers, I'd always generate and verify SHA256 checksums to ensure nothing was corrupted during transfer.

EC2 application not accessible — what will you check?

🔥 Hot

▾

Your Answer

Systematic troubleshooting approach: Layer 1 — Is the instance running? • AWS Console: Check instance state. Running? Stopped? Check system and instance status checks. Layer 2 — Network/Firewall: • Security Group: Does the inbound rule allow traffic on the app port (80/443/8080)? • NACL: Network ACL — are inbound AND outbound rules correct? (NACLs are stateless — need both). • Is the instance in the right subnet (public if directly accessible)? • Does the public subnet have an Internet Gateway attached? • Route table: Does 0.0.0.0/0 route to Internet Gateway? Layer 3 — Is the application running? • SSH into EC2: ssh -i key.pem ubuntu@ip • systemctl status nginx (or your app service) • Is it listening on the right port? ss -tuln | grep :80 Layer 4 — Application level: • Check app logs: journalctl -u app -f • curl localhost:8080 — does it respond locally? • Disk full? df -h (full disk can cause app to fail) • Memory? free -h Layer 5 — If behind a Load Balancer: • Check ALB target group health checks — are targets healthy? • Check ALB security group — does it allow traffic from the internet? • Does EC2 security group allow traffic from the ALB security group?

🐍Python / Coding Questions

Write a Python script to parse logs — extract JSON, separate by app, output message + timestamp.

🔥 Hot

▾

Your Answer

This is a coding round question. Here's the full answer:

import json
import os
from collections import defaultdict

def parse_logs(filepath):
    """
    Reads a log file where each line is a JSON object.
    Groups logs by 'application' field.
    Outputs message and timestamp for each log.
    """
    # Group logs by application name
    logs_by_app = defaultdict(list)

    # open(filepath, 'r') opens the file in read mode
    # Returns a file object f
    with open(filepath, 'r') as f:
        # 'for line in f' iterates line by line
        # Memory efficient — doesn't load whole file at once
        for line in f:
            line = line.strip()
            if not line:
                continue  # skip empty lines
            try:
                # Parse each line as JSON
                log_entry = json.loads(line)

                app_name = log_entry.get('application', 'unknown')
                message  = log_entry.get('message', '')
                timestamp = log_entry.get('timestamp', '')

                logs_by_app[app_name].append({
                    'timestamp': timestamp,
                    'message': message
                })
            except json.JSONDecodeError:
                print(f"Skipping non-JSON line: {line}")

    return logs_by_app

def main():
    filepath = 'app.log'

    if not os.path.exists(filepath):
        print(f"File not found: {filepath}")
        return

    logs_by_app = parse_logs(filepath)

    for app, entries in logs_by_app.items():
        print(f"\n=== Application: {app} ===")
        for entry in entries:
            print(f"[{entry['timestamp']}] {entry['message']}")

if __name__ == '__main__':
    main()

Explaining the concepts: • import json: Standard library module to parse/serialize JSON strings. • import os: OS-level operations — file existence checks, path manipulation, env vars. • open(file, 'r'): Opens file in read mode. Returns a file object. 'w' = write, 'a' = append, 'rb' = read binary. • for line in f: Iterates over the file object line by line. Memory efficient — doesn't load the whole file. • Alternative to read line by line: f.readlines() (loads ALL lines into a list — uses more memory) or f.read().splitlines(). • defaultdict(list): A dict that auto-initializes missing keys with an empty list. Health check script:

import subprocess
import sys

def check_service(service_name):
    result = subprocess.run(
        ['systemctl', 'is-active', service_name],
        capture_output=True, text=True
    )
    status = result.stdout.strip()
    if status == 'active':
        print(f"[OK] {service_name} is running")
        return True
    else:
        print(f"[FAIL] {service_name} is {status}")
        return False

services = ['nginx', 'postgresql', 'redis']
all_healthy = all(check_service(s) for s in services)
sys.exit(0 if all_healthy else 1)

Dictionary inversion. Palindrome check. Sorting without built-in.

▾

Your Answer

Dictionary inversion (swap keys and values):

original = {'a': 1, 'b': 2, 'c': 3}
inverted = {v: k for k, v in original.items()}
# {1: 'a', 2: 'b', 3: 'c'}

Palindrome check:

def is_palindrome(s):
    s = s.lower().replace(' ', '')
    return s == s[::-1]

print(is_palindrome("racecar"))  # True
print(is_palindrome("hello"))    # False

Sorting without built-in (Bubble Sort):

def bubble_sort(arr):
    n = len(arr)
    for i in range(n):
        for j in range(0, n - i - 1):
            if arr[j] > arr[j + 1]:
                arr[j], arr[j + 1] = arr[j + 1], arr[j]
    return arr

nums = [64, 34, 25, 12, 22]
print(bubble_sort(nums))  # [12, 22, 25, 34, 64]

🏛️Architecture & Scenario Questions

User says application is slow. What do you check?

🔥 Hot

▾

Your Answer

I approach this methodically using the USE method (Utilization, Saturation, Errors) and RED (Rate, Errors, Duration): Layer 1 — Infrastructure: • CPU: Is it spiking? top / CloudWatch metrics. • Memory: High usage causing swapping? free -h / CloudWatch. • Disk I/O: Is the disk saturated? iostat. Slow DB queries can cause this. • Network: High latency? Packet drops? iftop / CloudWatch. Layer 2 — Application: • Check application logs for error spikes, slow query warnings, timeout errors. • Are there more requests than usual? (Traffic spike?) Layer 3 — Database: • Slow queries? Check slow query log (MySQL) or pg_stat_statements (PostgreSQL). • Connection pool exhausted? Too many connections? • Missing indexes? EXPLAIN ANALYZE on slow queries. Layer 4 — Network/External: • DNS resolution slow? External API call slow? • CDN caching issues? Static assets not cached? Layer 5 — Application-level profiling: • APM tools (AWS X-Ray, New Relic, Datadog APM) trace individual requests. • Identify which service/function is slow. Layer 6 — Recent changes: • Was there a recent deployment? Could be a regression. • git log — what changed recently? My response to interviewer: "I'd start by checking infrastructure metrics — CPU, memory, disk — then application and database logs, then trace a specific slow request end-to-end to find the bottleneck."

Horizontal vs vertical scaling — how do you decide?

🔥 Hot

▾

Your Answer

Vertical Scaling (Scale Up): • Increase the size of existing server — more CPU, RAM. • Simple — no architecture changes needed. • Has a limit — you can only go so big. • Downtime usually required (instance resize). • Good for: Stateful applications, databases, when horizontal scaling is complex. Horizontal Scaling (Scale Out): • Add more instances/pods. Distribute load. • No theoretical limit — infinitely scalable. • Requires load balancer, stateless app design. • More resilient — no single point of failure. • Good for: Stateless microservices, web servers, APIs. Decision framework: • Is the app stateless? → Horizontal first. • Is it a database? → Vertical first (read replicas for reads = horizontal for reads). • Is the bottleneck CPU-bound (computation)? → Horizontal (more instances doing parallel work). • Is it memory-bound (caching, JVM heap)? → Sometimes vertical is faster fix, then optimize. • Cost: Horizontal with Spot instances is often cheaper than one giant instance. • Production best practice: Both. Right-size instances (vertical) + horizontal auto-scaling. In my project: Kubernetes HPA handles horizontal scaling automatically based on CPU/memory. Node-level, Karpenter/Cluster Autoscaler handles adding/removing EC2 nodes.

How do you introduce Spot instances in production? Cost optimization strategies.

⚠️ Tricky

▾

Your Answer

Spot instances are up to 90% cheaper than on-demand but can be interrupted with 2-minute notice. Strategy for safe Spot usage in production: 1. Mixed instance policy: Use Auto Scaling Groups with a mix — On-Demand for base capacity (70%), Spot for burstable capacity (30%). 2. Fault tolerance: Ensure your app handles interruptions gracefully. Spot interruption notices come as EC2 instance metadata — you can build handlers to drain the node before termination. 3. In Kubernetes (EKS): Use multiple Spot instance types and sizes — if one type is interrupted/unavailable, Karpenter picks another. Combine with pod disruption budgets so Kubernetes drains pods gracefully. 4. Stateless workloads only: Never run stateful databases on Spot. Use Spot for batch processing, CI/CD agents, stateless microservices, ML training. 5. Save on base: Use Reserved Instances or Savings Plans for predictable baseline load (1–3 year commitment = 40–70% savings). In my project: I used multi-architecture builds (amd64 + arm64) targeting both x86 and Graviton Spot instances — this maximized instance type diversity and reduced interruption probability while cutting compute costs by ~20%.

Prevent developers from leaving infrastructure running. Cost governance.

▾

Your Answer

Multiple approaches: 1. AWS Budgets + Alerts: Set budget thresholds. Alert the team lead or auto-respond (Lambda to stop instances) when budget exceeds X%. 2. Resource Tagging Policy + AWS Config: • Enforce mandatory tags (Environment, Owner, CostCenter) via AWS Config rules or Service Control Policies. • Untagged resources are automatically flagged or denied creation. 3. IAM + SCPs (Service Control Policies): • In AWS Organizations, SCPs can restrict what instance types or regions devs can use. • Deny creating resources above a certain size in dev accounts. 4. Auto-shutdown schedules: • AWS Instance Scheduler: Automatically stop dev EC2 instances at 7pm and start at 9am. • Lambda + EventBridge: Tag-based scheduler. 5. Terraform + IaC governance: • All infrastructure must be via Terraform with PR reviews. • Cost estimation via Infracost in the CI pipeline — shows cost delta for every PR. • Prevents manual console deployments. 6. AWS Trusted Advisor / Cost Explorer: Weekly reports sent to teams showing top cost offenders. 7. Sandbox account expiry: Dev environments auto-destroy after N days using Terraform destroy in CI.

🔐Security & Secrets

What happens if a private key is exposed? How do you respond?

🔥 Hot

▾

Your Answer

This is a security incident — treat it as such: Immediate response (minutes): 1. Revoke/rotate the key immediately — don't wait. Even if you think it wasn't seen. Revoke the AWS access key in IAM console. If SSH key, remove it from authorized_keys on all servers. If TLS private key, revoke the certificate with the CA. 2. Check CloudTrail (AWS) for any API calls made using that key — look for unusual activity (IAM changes, resource creation, data access). 3. Assume breach if any suspicious activity found — escalate. Investigation: 4. Determine scope: Where was the key used? What did it have access to? What was potentially accessed? 5. Check if the key was committed to Git: git log --all --grep='access_key' or use tools like git-secrets, TruffleHog to scan history. 6. If in git history: The key must still be revoked even after deletion from history — assume it was scraped by bots (GitHub is monitored by credential scrapers within seconds). Prevention going forward: • Use IAM roles instead of long-lived keys (for EC2/ECS/Lambda). • Use AWS Secrets Manager or HashiCorp Vault for all secrets. • Never hardcode credentials in code. • Set up git-secrets or pre-commit hooks to prevent committing secrets. • AWS has a Partner Program with GitHub to automatically detect and notify on exposed keys. If SSH private key exposed: • Generate new key pair. • Add new public key to all authorized servers. • Remove old public key from all authorized_keys files. • Audit SSH access logs for unauthorized logins.

How do you manage secrets in automation? How do you rotate credentials?

▾

Your Answer

Managing secrets in automation: 1. AWS Secrets Manager: Store secrets (DB passwords, API keys). Fetch at runtime via SDK or IAM role. Automatic rotation support. No secret in code or environment variables. 2. Kubernetes Secrets: Store as base64-encoded Kubernetes Secret objects. Mount as files or env vars. Better: Use External Secrets Operator — sync secrets from Secrets Manager/Vault into Kubernetes Secrets. 3. HashiCorp Vault: Dynamic secrets — instead of static passwords, Vault generates short-lived credentials on demand. App requests a DB credential → Vault generates one → credential auto-expires. 4. GitHub Actions / CI: Use GitHub Secrets — masked in logs. Access via ${{ secrets.MY_SECRET }}. Never print secrets in pipelines. 5. Never: Hardcode in code, commit to git, store in .env files in repo. Credential rotation: Automatic rotation (recommended): • AWS Secrets Manager supports automatic rotation with Lambda functions for RDS, Redshift, DocumentDB. • Define rotation schedule (e.g., every 30 days). Secrets Manager updates the password in both the secret and the database. Manual rotation process: 1. Generate new credential. 2. Update the secret in Secrets Manager/Vault. 3. Ensure apps pick up the new secret (restart or dynamic refresh). 4. Revoke old credential only after confirming new one works. 5. Verify no service disruption in logs. For SSH keys and TLS certs: • cert-manager in Kubernetes auto-renews Let's Encrypt certs. • Vault PKI auto-issues and expires short-lived certs.

IAM — Explicit Deny vs Allow. Permission evaluation flow.

⚠️ Tricky

▾

Your Answer

AWS IAM Permission Evaluation Order: 1. Explicit Deny wins ALWAYS. If any policy attached to the principal has an explicit Deny for the action, the request is denied — regardless of any Allow statements anywhere. 2. If no Deny, check for Allow. Must have at least one explicit Allow from: identity policy (user/group/role), resource policy, permission boundary, SCP (if in AWS Organizations), session policy. 3. Implicit Deny by default. If no explicit Allow exists, the request is denied. "Default deny" — AWS is deny-by-default. Permission Boundary scenario: • A permission boundary is an IAM managed policy that caps the maximum permissions a role can have. • Even if the role's identity policy allows s3:*, if the permission boundary only allows s3:GetObject, the effective permission is only s3:GetObject. • Permission = intersection of identity policy AND permission boundary. Evaluation flow: Explicit Deny? → DENY SCP allows? → Continue Permission Boundary allows? → Continue Identity Policy allows? → ALLOW Resource Policy allows? → ALLOW → DENY (implicit) Use case for Explicit Deny: • Deny all actions outside a specific region regardless of what any policy allows. • Deny deleting S3 objects from production bucket to prevent accidents.

💼Behavioral / Managerial

Why did you leave your previous company? Why did you leave without an offer?

🔥 Hot

▾

Your Answer

My internships were short-term positions with defined end dates — the roles at Xloud (Oct–Dec 2024) and CYfuture (Dec 2025–Feb 2026) were internships, not permanent positions. So I didn't "leave" in the traditional sense — they concluded as planned. Between roles, I focused on building projects (the 8-tier Kubernetes migration, technical writing on LinkedIn and Medium) to keep growing. I didn't rush to accept just any offer — I wanted to make sure my first full-time role was the right fit where I could genuinely contribute and grow. I have the time to be selective now. I'm interviewing with companies where I see real growth potential, and I'm focused on finding the right long-term opportunity, not the fastest one.

Tip: Never say anything negative about previous companies. Keep it professional and forward-looking.

What is failure to you? What are you afraid of?

▾

Your Answer

Failure to me: Failure isn't failing at a task — that's just learning. Real failure would be failing to try, or failing to learn from a mistake. If I break something in a dev environment trying something new — that's a lesson. If I break it twice the same way — that's a failure of not learning. What I'm afraid of: Professionally, I'm afraid of stagnation — being in a role for two years and not being meaningfully better than when I started. I combat that by always having a learning goal — right now I'm focused on deepening my Kubernetes and Terraform skills. Personally, I want to give an honest answer: I'm early in my career and one of my concerns is whether I can perform well under production pressure. But I think that's natural and it motivates me to be well-prepared and learn from every incident.

Tip: Showing self-awareness about early career pressure is actually a green flag for most hiring managers — it shows maturity.

Are you comfortable with weekend production work? Can you work in UK shift? What if stuck at night with team offline?

⚠️ Tricky

▾

Your Answer

Weekend production work: Yes, I understand that infrastructure doesn't observe business hours and production incidents happen when they happen. I'm comfortable with being on-call, especially as I build experience. I'd want to understand the on-call rotation so there's predictability, but I'm not averse to weekend work. UK shift: I'm open to discussing it. [If you are genuinely comfortable with it, say yes confidently. If not, say "I'm open to a hybrid — say a 1pm–10pm IST overlap with UK morning."] Stuck at night with team offline: My approach: 1. Check all available runbooks and documentation first. 2. Assess: Is this a complete outage or degraded performance? Can it wait till morning? 3. If critical: Attempt to roll back to last known good state (git revert, previous deployment, load balancer switch). 4. Escalate if needed — I'd know who the senior on-call contact is and reach them if the incident is severe. 5. Never guess in production — if I'm unsure about a fix, I'd prefer to maintain the rollback state and wait for a senior. 6. Document everything I'm doing in real time in an incident channel. Being honest about asking for help is not weakness — it's professionalism.

Have you experienced favoritism / bias at workplace? Integrity question.

▾

Your Answer

I'm early in my career so my workplace exposure has been limited to internships, but I haven't experienced overt favoritism. If I ever did, I'd handle it professionally: I'd focus on what I can control — my output, my reliability, and the quality of my work. If favoritism affected a team's performance or morale collectively, I'd raise it through proper channels — my manager or HR — with specific, factual examples rather than complaints. On integrity: I believe in being transparent, especially in technical roles. If I don't know something, I say so rather than guessing and risking a production issue. If I make a mistake, I own it, document it, and share what I learned so the team doesn't repeat it. Covering up errors in DevOps/ops roles can snowball into much bigger problems.

Why are you still interviewing after receiving offers? Which AI tools do you use?

▾

Your Answer

Still interviewing with offers: I believe in making informed decisions. My first full-time role sets the foundation for my career — I want to make sure I join a team where I'll grow, where the tech stack is relevant, and where the culture is healthy. Having options means I'm not choosing out of desperation, but out of genuine fit. I'm close to making a decision. [Important: Say this only if true. If you have no offers, don't pretend.] AI tools I use: I use Claude and ChatGPT regularly for learning — especially to understand concepts from multiple angles, debug configurations, and review my technical writing. I use them as study partners, not as crutches. For DevOps specifically: GitHub Copilot for scripting, Perplexity for quick technical lookups. I believe AI tools make you faster at tasks you already understand — they're dangerous if you use them to do tasks you don't understand at all, because you won't catch their mistakes. Free or licensed: Primarily free tiers. I use Claude.ai and ChatGPT free. For professional work I'd be open to whichever tools the team uses.

How fast can you learn new technology? Would you travel back in time to change anything?

▾

Your Answer

Learning speed: I learn by doing — give me a goal and I'll figure out the path. When I needed to deploy Kubernetes for my 8-tier project, I had limited K8s experience. I broke it down — first understand the core components, then get a single pod running, then services, then StatefulSets. Within a few weeks I had a working multi-tier cluster with CI/CD and network policies. I also document what I learn on LinkedIn and Medium — teaching forces you to truly understand something. If someone asks me a question I can't answer well, it tells me where my gap is. Travel back in time: Honestly, I'd start learning Linux and DevOps a year earlier than I did. I came into it after BCA started when I could have been building hands-on skills from day one. But every step of the journey — even the time I spent on pure theory — gave me the context to appreciate why things work the way they do. [Keep this light and honest. It's a culture-fit question checking for self-awareness, not a deep philosophical question.]

⭐Additional Questions to Expect

Describe a complex automation you built. Can you quantify the improvement?

Extra🔥 Hot

▾

Your Answer

The most complex automation I've built is the CI/CD pipeline for my 8-tier microservices Kubernetes project. The problem: Manual Docker builds for 8 different services, manually pushing to registry, manually applying Kubernetes manifests. Error-prone and time-consuming. The automation: Built a GitHub Actions pipeline that: 1. Triggers on push to main. 2. Runs security scans (Docker Scout vulnerability scan + SBOM generation via Syft). 3. Builds multi-architecture Docker images (amd64 + arm64) using Docker Buildx + QEMU emulation — supporting both x86 and AWS Graviton instance types. 4. Tags images with git SHA for traceability. 5. Pushes to container registry. 6. Applies Kubernetes manifests using kubectl for rolling deployment. Quantified improvement: • Deployment time: From ~25 minutes manual (build, push, apply for 8 services) to ~8 minutes automated. • Cost: Multi-arch images allowed using arm64 Graviton Spot instances — ~20% compute cost reduction. • Security: Every deployment automatically scanned — zero unscanned images reach the cluster. • Human error: Eliminated manual kubectl apply mistakes. This isn't a massive enterprise automation, but it's something I designed, built, and can explain end-to-end.

MongoDB vs RDS — internal difference between SQL vs NoSQL.

Extra

▾

Your Answer

RDS (Relational / SQL): • Stores data in tables with fixed schema (rows and columns). • ACID compliant — Atomicity, Consistency, Isolation, Durability. Strong data integrity. • SQL query language — powerful JOINs across tables. • Best for: Structured data with clear relationships (e-commerce, banking, ERP). • Scales vertically primarily, horizontal sharding is complex. • AWS RDS supports: MySQL, PostgreSQL, MariaDB, Oracle, MSSQL, Aurora. MongoDB (Document / NoSQL): • Stores data as JSON-like BSON documents. Schema-less — each document can have different fields. • Eventually consistent by default (configurable). More flexible. • No JOINs (by design) — embed related data in documents. • Best for: Flexible, rapidly changing schemas (content management, user profiles, IoT data). • Horizontally scalable — native sharding built in. Internal difference: SQL: B-Tree indexes for fast lookups. Query planner optimizes JOIN operations. Transactions with rollback support. MongoDB: Uses WiredTiger storage engine with MVCC (Multi-Version Concurrency Control). Documents stored as BSON (Binary JSON). Can do aggregation pipelines. No foreign key enforcement at DB level. When to choose: • Need complex queries with JOINs and strong consistency → SQL/RDS. • Need flexible schema, rapid iteration, horizontal scale → MongoDB. • Financial data → Always relational (ACID is critical). Change Data Capture (CDC): A technique to track and capture every INSERT/UPDATE/DELETE in a database in real-time. Used for: • Replicating data to data warehouses. • Keeping microservices in sync. • Tools: AWS DMS, Debezium (open source, works with Kafka).

What is CloudFormation? CDK — What is a construct, stack? Why multiple CDK apps?

Extra

▾

Your Answer

CloudFormation: AWS's native IaC service. You define infrastructure in JSON or YAML templates. CF manages the lifecycle — create, update, delete stacks. If a resource already exists when you deploy a CF stack, it returns a "already exists" error — you need to import it first. AWS CDK (Cloud Development Kit): IaC using real programming languages — TypeScript, Python, Java, C#, Go. CDK synthesizes to CloudFormation templates under the hood, then deploys them. Construct: The basic building block in CDK. A construct represents one or more CloudFormation resources. • L1 Construct: Direct 1:1 mapping to CloudFormation resource (CfnBucket). • L2 Construct: Higher-level abstraction with sensible defaults (s3.Bucket) — handles permissions, encryption, etc. • L3/Patterns: Opinionated combinations of multiple resources (aws-ecs-patterns.ApplicationLoadBalancedFargateService). Stack: A Stack is a deployable unit in CDK. It maps 1:1 to a CloudFormation stack. One CDK App can have multiple stacks. Why multiple CDK apps vs one? • Separation of concerns: Networking stack, app stack, and data stack can be independently deployed and versioned. • Different deployment frequencies: Infrastructure (VPC) rarely changes; app stack deploys with every release. • Independent IAM permissions: Different teams own different stacks. • Blast radius: A bug in the app stack doesn't risk destroying your network infrastructure. I have conceptual knowledge of CDK — I haven't deployed production CDK apps. I'm transparent about that.

Design a highly available multi-region architecture.

Extra🔥 Hot

▾

Your Answer

For a multi-region HA architecture, I'd design around: DNS Layer: • Route53 with Health Checks + Latency-Based or Failover routing. • Primary region: us-east-1. Secondary: eu-west-1. • Route53 automatically routes to secondary if primary health check fails. Frontend: • S3 static hosting in each region + CloudFront (global CDN with edge locations). • Single CloudFront distribution serves globally — no regional failover needed here. Backend (Application): • EKS/ECS cluster in both regions. • ALB in each region. • Auto Scaling across multiple AZs within each region. • Active-Active (both serve traffic) or Active-Passive (failover only). Database (hardest part): • Aurora Global Database: Primary in us-east-1, read replicas in eu-west-1. Failover promotes secondary to primary in ~1 minute. • Or DynamoDB Global Tables: Active-Active, multi-region, automatic replication. • If using PostgreSQL: RDS with cross-region read replicas, manual failover. Data Replication: • S3 Cross-Region Replication for file storage. • SQS/SNS/EventBridge for event-driven cross-region messaging. Observability: • Centralized logging (CloudWatch Logs with cross-region subscription). • Datadog or Grafana with multi-region data sources. Trade-offs to mention: Multi-region adds significant complexity and cost. Most apps don't need it — multi-AZ within a single region gives 99.99% availability for most use cases. Multi-region is for RTO < 1 min requirements or global user base.

How to manage dependent deployments in ArgoCD? ArgoCD usage.

Extra

▾

Your Answer

ArgoCD is a GitOps continuous delivery tool for Kubernetes. It watches a Git repository and automatically syncs the cluster state to match what's in Git. Core concept: Git is the source of truth. You commit Kubernetes manifests/Helm charts to Git → ArgoCD detects the change → syncs the cluster. Managing dependent deployments: 1. App of Apps pattern: A parent ArgoCD Application manages child Applications. The parent deploys in order, and child apps can have sync dependencies. 2. Sync Waves: Use ArgoCD sync waves via annotation — resources with lower wave numbers deploy first. argocd.argoproj.io/sync-wave: "1" (deploys first — e.g., DB) argocd.argoproj.io/sync-wave: "2" (deploys second — e.g., backend) argocd.argoproj.io/sync-wave: "3" (deploys last — e.g., frontend) 3. Sync Hooks: PreSync, Sync, PostSync, SyncFail hooks — run jobs before/after deployment. Use PreSync to run DB migrations before app deployment. 4. Health checks: ArgoCD won't proceed to the next wave until the current wave's resources are healthy. My experience: I've worked with Kubernetes directly with kubectl and GitHub Actions. ArgoCD I understand conceptually — it's on my active learning list. I'm transparent that I haven't deployed ArgoCD in production.

S3 Lifecycle policy, Storage classes, IA meaning. Terraform null_resource.

Extra

▾

Your Answer

S3 Lifecycle Policy: Automatically transition objects between storage classes or delete them based on age. Example: • Day 0–30: S3 Standard (frequent access) • Day 30–90: S3 Standard-IA (infrequent access) • Day 90–180: S3 Glacier Instant Retrieval • Day 180+: S3 Glacier Deep Archive (cheapest, retrieval takes hours) • Day 365: Delete Storage Classes: • S3 Standard: High availability, high cost. Frequently accessed. • Standard-IA (Infrequent Access): Lower cost, but retrieval fee. Good for backups. • One Zone-IA: Even cheaper, single AZ. Risk of data loss if AZ fails. • Glacier Instant Retrieval: Archive with millisecond access. • Glacier Flexible: Archive, 1-12 hour retrieval. • Glacier Deep Archive: Lowest cost, 12-48 hour retrieval. • S3 Intelligent-Tiering: AWS automatically moves objects between tiers based on access patterns. IA = Infrequent Access. You pay less per GB stored, but pay per retrieval. Terraform null_resource: A resource that does nothing by itself — used as a trigger mechanism or to run local-exec/remote-exec provisioners.

resource "null_resource" "run_script" {
  triggers = {
    always_run = timestamp()
  }
  provisioner "local-exec" {
    command = "bash ./setup.sh"
  }
}

Use when you need to run a script or command as part of Terraform without managing an actual cloud resource.

🚀 Vishal Tyagi — Interview Prep Guide