QA Engineer · Islamabad est. 2020

Zara Shafiq

I'm a QA engineer. Four years finding the bugs other people miss, in AI products, virtual collaboration software, and regulated healthcare.

currently solo QA on a new Electron desktop product
4+
years in QA
2500+
test cases written
500+
defects tracked
130+
automation specs
CASE 01 · Healthcare · 2022

The billing edge case that almost shipped

Junior QA on a US-based EHR product (the therapy module: appointment scheduling, clinical notes, patient records, billing). HIPAA-regulated. Release bar of 99% defect-free.

I was working through a billing flow during a release cycle. Therapist completes a session, the system calculates the charge, the claim gets generated. For a standard session the numbers looked correct. When I switched to a less common session type (one that hit a specific billing-code boundary), the calculated charge didn't match what I worked out on paper.

I wrote out the expected charges across each scenario, ran them through the system one by one, and confirmed the mismatch was consistent and reproducible. Filed the bug with the math, the scenarios, and screenshots. Looped in the dev lead the same day.

Fixed before release. If it had shipped, therapists using that session type would have been under-billing for the work they'd done. In a regulated healthcare product, that's both a revenue and a compliance issue.

Never trust the UI when the math is doing the actual work. If a calculation looks right on screen, that's exactly when I check it by hand.

ehr billing hipaa manual
CASE 02 · EdTech · 2023–24

Testing six products on real hardware, not just dev environments

Manual QA at an EdTech company building a connected ecosystem: an AI quiz generator, an AI transcription and translation service, a video conferencing tool, a lesson capture platform, an NFC-badge notification system, and a lesson builder. All of it ran across web, Android, and Android-based interactive touchscreen displays.

The default at the company was to test on dev environments and emulators, because real hardware was slower and more annoying to set up. I pushed to test on the actual touchscreen units instead, because the bugs that mattered were the ones that showed up only on real hardware: touch input latency under heavy load, NFC tap reliability with real badges, audio routing on the embedded Android, cross-device sync between a touchscreen and a teacher's mobile.

That choice meant I had to context-switch fast (a hardware bug in the morning, an AI accuracy check at lunch, a Stripe webhook validation in the afternoon) and learn each product's edges quickly. Caught wrong-question bugs in the AI quiz generator, transcription errors in the captioning tool, and a Stripe billing edge case where a subscription state transition wasn't firing the right webhook. The Stripe flow was where I learned subscription testing the hard way: every plan transition, every webhook event, every failed payment branch needed its own case.

Emulators tell you what the product does in theory. Real hardware tells you what it does in the room with the people who'll actually use it.

edtech hardware android ai-features stripe
CASE 03 · Real-time AI · 2026

Stress-testing a meeting room we couldn't afford to stress-test

Solo QA on a new Electron-based virtual office product. One of the meeting rooms routes through a production dialer for sales calls and customer demos. Customers were starting to ask how many concurrent dialer sessions the room could support. Internally, no one had a real answer.

Testing with real dialers at scale would have been expensive: real phone numbers, real per-minute call costs, real load on the dialer provider. My local machine also had hardware limits, so I couldn't just spin up more instances locally.

I worked through the test design in pieces. First, I substituted screen-share-with-sound for the real dialer. Both flow through the same LiveKit publisher pipeline under the hood, so the load on the backend would be equivalent without the per-call cost. Next, I moved off my local machine onto GitHub Actions runners to get past the hardware limits (36 runners in parallel). Then I asked the dev team to add a flag that allowed multiple instances of the app to run on the same machine, so a single runner could carry more than one user. Finally, I ran the test under realistic 3-window load (workspace + dialer + meeting room running simultaneously per instance), so the numbers reflected actual user behavior, not a synthetic baseline.

Verified 33 concurrent screen-share-with-sound publishers under production-equivalent load. The product team had a real number to give to customers, and a real understanding of where capacity actually sat (higher than the team had assumed). The setup is reusable, so when engineering ships changes to the real-time layer we can re-run the test instead of guessing.

Playwright test harness dispatches to 36 runners 36 x runner_01 3 x app instance workspace + dialer + meeting room runner_02 3 x app instance workspace + dialer + meeting room ... runner_36 3 x app instance workspace + dialer + meeting room LiveKit publisher pipeline screen-share-with-sound = dialer (same backend pressure) → 33 concurrent publishers verified under realistic load
Power House · test architecture

A lot of capacity questions get answered with "we think it's around X." Going and measuring it takes more design work but produces a number you can actually defend.

electron livekit load playwright github-actions

A short note, because most of my recent work has involved AI features and the testing practices are different.

Output Validation Comparing AI-generated content against source material by hand, looking for the failure modes (hallucination, drift, mismatched answers, multiple valid answers marked as one). Flagging patterns back to the data science team.
Prompt Regression When a model or prompt changes, the same inputs need to produce the same kind of outputs. Tracking output quality across prompt iterations and model updates.
Ranking Quality For AI matching and ranking features (e.g. candidate ranking, search ranking), checking whether the top results are actually relevant or just plausible-looking.
UI Around AI The AI is usually a third-party integration, but the product around it isn't. Most bugs live in the user-facing layer: input handling, editing, exporting, organizing AI-generated content.
What I Watch For Plausible-looking output is the most dangerous kind. The AI can be confidently wrong in a way that the UI does nothing to flag. The job is to assume every "looks fine" output is suspect until verified.

Sanitized examples in the format I use day to day. The bugs are real, the company and product names are not.

Critical BUG-2204 EHR · Billing Module

Therapy session charge miscalculated on specific billing-code boundary

Steps
  1. Log in as therapist with billing permissions.
  2. Schedule and complete a session using session type [REDACTED-X] at the billing-code boundary [REDACTED-CODE].
  3. Navigate to Billing → Generate Claim.
  4. Observe the calculated charge in the claim preview.
Expected
Charge equals (base rate × session duration multiplier) + applicable modifiers, matching the published billing schedule.
Actual
Charge is under-calculated by the modifier amount on this specific session type. Reproducible across multiple test patients and dates.
Impact
Therapists using this session type would be under-billing for completed work. Revenue and compliance risk in a HIPAA-regulated product.
Notes
Manual verification against billing schedule confirms the formula is being applied incorrectly only at this boundary. Other session types calculate correctly.
High BUG-1138 Virtual Office · Recording

Recording indicator desync after UI merge: red light shows wrong state

Steps
  1. Join Power House meeting room.
  2. Click Record. Wait for confirmation.
  3. Click Stop Record. Wait for confirmation.
  4. Observe the recording indicator (red dot) in the top toolbar.
Expected
Red recording indicator should disappear immediately after Stop Record is acknowledged.
Actual
Red indicator remains visible for 4-7 seconds after recording has actually stopped. Users believe recording is still active when it is not. Inverse case also happens: indicator briefly disappears mid-recording, making users think recording stopped.
Impact
Trust in recording state is broken. Users either keep recording when they think they've stopped, or stop talking when they think recording has paused. Both create real customer-facing problems in sales demos.
Notes
Bug surfaced after a UI merge that re-wired the recording state event listener. Underlying recording works correctly; only the visual indicator is desynced.
Medium BUG-0867 EdTech · AI Quiz Generator

AI quiz generator produces questions with multiple valid answers

Steps
  1. Open the AI quiz generator.
  2. Paste a paragraph of source content covering a topic with ambiguous facts.
  3. Generate a multiple-choice quiz with 5 questions.
  4. Review each question against the source content.
Expected
Each generated question should have exactly one correct answer that is verifiable from the source content.
Actual
For certain question types, more than one of the answer options is technically correct based on the source, but only one is marked as the correct answer. Students selecting a different (also correct) answer would be marked wrong.
Impact
Quiz integrity issue. Teachers using these quizzes in classrooms would mark students wrong for technically correct answers. Discovered before deployment to schools.
Notes
Reproducible with multiple input sources. AI module is third-party, but our validation layer should be catching this before quizzes are saved.

Sanitized examples showing how I structure test cases, with a focus on edge and boundary scenarios rather than only happy paths.

TC-BILL-018 Boundary Value
Verify charge calculation at billing-code boundary transition
Pre-conditions Therapist account with billing permissions. Test patient with active insurance. Steps Complete sessions at the exact boundary value, one unit above it, and one unit below it. Generate claims for each. Expected Each charge matches the published billing schedule formula. Boundary transition handled correctly (no off-by-one in tier classification).
TC-AUTH-031 Negative
Verify session persistence after backend logout
Pre-conditions User logged into desktop app. Test admin has access to revoke sessions server-side. Steps Revoke the user's session server-side while the desktop app is still running. Attempt to take an authenticated action (open a meeting, send a message). Expected App detects invalid session, surfaces a clear re-auth prompt, and does not silently fail or expose stale data.
TC-STRIPE-014 Equivalence Partition
Subscription downgrade across plan tiers
Pre-conditions Test account on a paid tier. Stripe test mode enabled. Steps Downgrade from each higher tier to each lower tier (one case per valid pair). Verify webhook events fire correctly. Verify proration is calculated correctly. Expected Correct webhook event sequence (customer.subscription.updated). Proration credit matches Stripe's stated formula. Feature access reflects the new tier immediately.
TC-CALL-007 Edge Case
Concurrent screen-share with audio at session capacity
Pre-conditions Test fleet of multiple instances configured. Production-equivalent load profile. Steps Scale up concurrent screen-share-with-sound publishers in increments. At each level, validate publish success, audio quality, and reconnect behavior under simulated network jitter. Expected System sustains publish count up to the documented capacity. Beyond capacity, failures are graceful (clear error to user, no orphaned sessions, no resource leaks).
TC-NFC-003 Integration
NFC badge tap triggers correct downstream notification channels
Pre-conditions Physical NFC badge paired with a registered user. Notification preferences set to email + SMS + voice. Steps Tap badge on the touchscreen. Verify each downstream channel (email server, SMS gateway, voice API) receives the correct payload within the expected window. Expected All three channels deliver. Payload matches the badge holder's profile. Failure on one channel does not block delivery on the others.

The vocabulary, briefly, for the ways I approach a test plan.

Boundary Value Analysis Test the values on either side of every threshold. Off-by-one errors live here.
Equivalence Partitioning Split inputs into groups that should behave the same. Test one representative per group, not every value.
Exploratory Testing Unscripted, hypothesis-driven testing. Best for new features and for after a release when the spec doesn't cover what users will actually do.
Negative Testing Explicit tests for invalid inputs, broken states, and unauthorized actions. The happy path is the smallest part of any feature.
Regression Testing Automated coverage of the critical flows so manual time can focus on what changed and what's new.
Risk-Based Testing Test depth follows business impact. Payments, security, and data integrity get more attention than copy or styling.

A few things I've come to believe after four years of doing this.

The UI is a liar.

When a calculation looks right on screen, that's the moment to check the math by hand. The UI is the last place a bug gets revealed, not the first. Most of the real bugs are in the layer underneath, and the UI is doing its best to hide them.

Test what you can't easily test.

If something is expensive to test, that's usually the most important thing to test. Capacity. Real hardware. Production-equivalent load. The bugs that ship are almost always the ones nobody wanted to bother reproducing.

Automation is a tool, not a strategy.

I use Playwright every day, but automation isn't the whole job. The best setup I've worked with combines targeted automation that guards the critical flows with manual exploration that finds the edge cases. "100% automated" QA tends to be 100% green and still missing the real bugs.

If you can't reproduce it, it isn't fixed.

"Works on my machine" is the start of a bug investigation, not the end of one. Half the value of QA is the discipline of reproduction, the part where you actually find the conditions that cause the bug.

// playwright + cdp · multi-user electron automation specs/multi-user/screen-share.spec.ts
import { test, expect } from '@playwright/test';
import { attachToElectronApp, injectSession } from '../helpers/cdp';

test('two users in Power House, host shares screen with sound', async () => {
  // each user runs in its own browser profile
  // session cookies injected per profile so reloads survive
  const host  = await attachToElectronApp({ profile: 'zara' });
  const guest = await attachToElectronApp({ profile: 'raiha' });

  await injectSession(host,  'host.session.json');
  await injectSession(guest, 'guest.session.json');

  await host.joinRoom('power-house');
  await guest.joinRoom('power-house');

  await host.startScreenShare({ withAudio: true });

  // guest should see the publisher within 3s
  await expect(guest.publisherTile('zara')).toBeVisible({ timeout: 3000 });
  await expect(guest.audioLevel('zara')).toBeGreaterThan(0);
});

Most of my AI testing so far has been manual: comparing model output against source content, watching for the failure modes, flagging patterns. The next direction I'm pushing on is making that more systematic, treating LLM outputs less like single answers to grade and more like distributions to measure.

Reading and experimenting with:

  • Promptfoo and LangSmith for prompt regression and structured evals
  • The OpenAI evals framework and how teams build ground-truth datasets
  • Where deterministic QA breaks down for LLM outputs, and what replaces it

I started in mobile development, decided I preferred breaking things to building them, and have been doing QA ever since.

I've worked across regulated healthcare, a connected EdTech ecosystem of six products spanning web, Android, and touchscreen hardware, and currently I'm solo QA on a new Electron-based virtual office product, owning test strategy, automation, and release sign-off for a small engineering team.

The work I find most interesting is the kind where the UI looks fine, the dev tested it in staging, and somewhere underneath the math doesn't add up.

Outside of work, I write fiction. Usually about people stuck in places they can't leave.

automation Playwright (TypeScript), CDP for Electron, Cypress, Selenium with Appium
api & perf Postman (collections, contract testing), JMeter (load, stress, soak)
ai_quality LLM output validation, prompt regression, ranking-quality checks
payments Stripe subscription lifecycle, webhook event validation
workflow Jira, Git/GitHub, Agile/Scrum, CI-integrated test runs
languages TypeScript, JavaScript, Python