Zara Shafiq · QA Engineer

About

I started in mobile development, decided I preferred breaking things to building them, and have been doing QA ever since.

I've worked across regulated healthcare, a connected EdTech ecosystem of six products spanning web, Android, and touchscreen hardware, and currently I'm solo QA on an Electron-based virtual office product, owning test strategy, automation, and release sign-off for a small engineering team.

The work I find most interesting is the kind where the UI looks fine, the dev tested it in staging, and somewhere underneath the math doesn't add up.

Outside of work, I write fiction. Usually about people stuck in places they can't leave.

years in QA

2500+

test cases written

500+

defects tracked

130+

automation specs

Selected Work

CASE 01·Healthcare·2022

The billing edge case that almost shipped

Junior QA on a US-based EHR product (the therapy module: appointment scheduling, clinical notes, patient records, billing). HIPAA-regulated. Release bar of 99% defect-free.

I was working through a billing flow during a release cycle. Therapist completes a session, the system calculates the charge, the claim gets generated. For a standard session the numbers looked correct. When I switched to a less common session type (one that hit a specific billing-code boundary), the calculated charge didn't match what I worked out on paper.

I wrote out the expected charges across each scenario, ran them through the system one by one, and confirmed the mismatch was consistent and reproducible. Filed the bug with the math, the scenarios, and screenshots. Looped in the dev lead the same day.

Fixed before release. If it had shipped, therapists using that session type would have been under-billing for the work they'd done. In a regulated healthcare product, that's both a revenue and a compliance issue.

Never trust the UI when the math is doing the actual work. If a calculation looks right on screen, that's exactly when I check it by hand.

ehrbillinghipaamanual

CASE 02·EdTech·2023–24

Testing six products on real hardware, not just dev environments

Manual QA at an EdTech company building a connected ecosystem: an AI quiz generator, an AI transcription and translation service, a video conferencing tool, a lesson capture platform, an NFC-badge notification system, and a lesson builder. All of it ran across web, Android, and Android-based interactive touchscreen displays.

The default at the company was to test on dev environments and emulators, because real hardware was slower and more annoying to set up. I pushed to test on the actual touchscreen units instead, because the bugs that mattered were the ones that showed up only on real hardware: touch input latency under heavy load, NFC tap reliability with real badges, audio routing on the embedded Android, cross-device sync between a touchscreen and a teacher's mobile.

That choice meant I had to context-switch fast (a hardware bug in the morning, an AI accuracy check at lunch, a Stripe webhook validation in the afternoon) and learn each product's edges quickly. Caught wrong-question bugs in the AI quiz generator, transcription errors in the captioning tool, and a Stripe billing edge case where a subscription state transition wasn't firing the right webhook. The Stripe flow was where I learned subscription testing the hard way: every plan transition, every webhook event, every failed payment branch needed its own case.

Emulators tell you what the product does in theory. Real hardware tells you what it does in the room with the people who'll actually use it.

edtechhardwareandroidai-featuresstripe

CASE 03·Real-time AI·2026

Stress-testing a meeting room we couldn't afford to stress-test

Solo QA on a new Electron-based virtual office product. One of the meeting rooms routes through a production dialer for sales calls and customer demos. Customers were starting to ask how many concurrent dialer sessions the room could support. Internally, no one had a real answer.

Testing with real dialers at scale would have been expensive: real phone numbers, real per-minute call costs, real load on the dialer provider. My local machine also had hardware limits, so I couldn't just spin up more instances locally.

I worked through the test design in pieces. First, I substituted screen-share-with-sound for the real dialer. Both flow through the same LiveKit publisher pipeline under the hood, so the load on the backend would be equivalent without the per-call cost. Next, I moved off my local machine onto GitHub Actions runners to get past the hardware limits (36 runners in parallel). Then I asked the dev team to add a flag that allowed multiple instances of the app to run on the same machine, so a single runner could carry more than one user. Finally, I ran the test under realistic 3-window load (workspace + dialer + meeting room running simultaneously per instance), so the numbers reflected actual user behavior, not a synthetic baseline.

Verified 33 concurrent screen-share-with-sound publishers under production-equivalent load. The product team had a real number to give to customers, and a real understanding of where capacity actually sat (higher than the team had assumed). The setup is reusable, so when engineering ships changes to the real-time layer we can re-run the test instead of guessing.

Power House · test architecture

A lot of capacity questions get answered with "we think it's around X." Going and measuring it takes more design work but produces a number you can actually defend.

electronlivekitloadplaywrightgithub-actions

Testing AI Products

A short note, because most of my recent work has involved AI features and the testing practices are different.

Output ValidationComparing AI-generated content against source material by hand, looking for the failure modes (hallucination, drift, mismatched answers, multiple valid answers marked as one). Flagging patterns back to the data science team.

Prompt RegressionWhen a model or prompt changes, the same inputs need to produce the same kind of outputs. Tracking output quality across prompt iterations and model updates.

Ranking QualityFor AI matching and ranking features (e.g. candidate ranking, search ranking), checking whether the top results are actually relevant or just plausible-looking.

UI Around AIThe AI is usually a third-party integration, but the product around it isn't. Most bugs live in the user-facing layer: input handling, editing, exporting, organizing AI-generated content.

What I Watch ForPlausible-looking output is the most dangerous kind. The AI can be confidently wrong in a way that the UI does nothing to flag. The job is to assume every "looks fine" output is suspect until verified.

Sample Bug Reports

Sanitized examples in the format I use day to day. The bugs are real, the company and product names are not.

CriticalBUG-2204EHR · Billing Module

Therapy session charge miscalculated on specific billing-code boundary

Steps

Log in as therapist with billing permissions.
Schedule and complete a session using session type [REDACTED-X] at the billing-code boundary [REDACTED-CODE].
Navigate to Billing → Generate Claim.
Observe the calculated charge in the claim preview.

Expected

Charge equals (base rate × session duration multiplier) + applicable modifiers, matching the published billing schedule.

Actual

Charge is under-calculated by the modifier amount on this specific session type. Reproducible across multiple test patients and dates.

Impact

Therapists using this session type would be under-billing for completed work. Revenue and compliance risk in a HIPAA-regulated product.

Notes

Manual verification against billing schedule confirms the formula is being applied incorrectly only at this boundary. Other session types calculate correctly.

HighBUG-1138Virtual Office · Recording

Recording indicator desync after UI merge: red light shows wrong state

Steps

Join Power House meeting room.
Click Record. Wait for confirmation.
Click Stop Record. Wait for confirmation.
Observe the recording indicator (red dot) in the top toolbar.

Expected

Red recording indicator should disappear immediately after Stop Record is acknowledged.

Actual

Red indicator remains visible for 4-7 seconds after recording has actually stopped. Users believe recording is still active when it is not. Inverse case also happens: indicator briefly disappears mid-recording, making users think recording stopped.

Impact

Trust in recording state is broken. Users either keep recording when they think they've stopped, or stop talking when they think recording has paused. Both create real customer-facing problems in sales demos.

Notes

Bug surfaced after a UI merge that re-wired the recording state event listener. Underlying recording works correctly; only the visual indicator is desynced.

MediumBUG-0867EdTech · AI Quiz Generator

AI quiz generator produces questions with multiple valid answers

Steps

Open the AI quiz generator.
Paste a paragraph of source content covering a topic with ambiguous facts.
Generate a multiple-choice quiz with 5 questions.
Review each question against the source content.

Expected

Each generated question should have exactly one correct answer that is verifiable from the source content.

Actual

For certain question types, more than one of the answer options is technically correct based on the source, but only one is marked as the correct answer. Students selecting a different (also correct) answer would be marked wrong.

Impact

Quiz integrity issue. Teachers using these quizzes in classrooms would mark students wrong for technically correct answers. Discovered before deployment to schools.

Notes

Reproducible with multiple input sources. AI module is third-party, but our validation layer should be catching this before quizzes are saved.

Sample Test Cases

Sanitized examples showing how I structure test cases, with a focus on edge and boundary scenarios rather than only happy paths.

TC-BILL-018Boundary Value

Verify charge calculation at billing-code boundary transition

Pre-conditionsTherapist account with billing permissions. Test patient with active insurance.StepsComplete sessions at the exact boundary value, one unit above it, and one unit below it. Generate claims for each.ExpectedEach charge matches the published billing schedule formula. Boundary transition handled correctly (no off-by-one in tier classification).

TC-AUTH-031Negative

Verify session persistence after backend logout

Pre-conditionsUser logged into desktop app. Test admin has access to revoke sessions server-side.StepsRevoke the user's session server-side while the desktop app is still running. Attempt to take an authenticated action (open a meeting, send a message).ExpectedApp detects invalid session, surfaces a clear re-auth prompt, and does not silently fail or expose stale data.

TC-STRIPE-014Equivalence Partition

Subscription downgrade across plan tiers

Pre-conditionsTest account on a paid tier. Stripe test mode enabled.StepsDowngrade from each higher tier to each lower tier (one case per valid pair). Verify webhook events fire correctly. Verify proration is calculated correctly.ExpectedCorrect webhook event sequence (customer.subscription.updated). Proration credit matches Stripe's stated formula. Feature access reflects the new tier immediately.

TC-CALL-007Edge Case

Concurrent screen-share with audio at session capacity

Pre-conditionsTest fleet of multiple instances configured. Production-equivalent load profile.StepsScale up concurrent screen-share-with-sound publishers in increments. At each level, validate publish success, audio quality, and reconnect behavior under simulated network jitter.ExpectedSystem sustains publish count up to the documented capacity. Beyond capacity, failures are graceful (clear error to user, no orphaned sessions, no resource leaks).

TC-NFC-003Integration

NFC badge tap triggers correct downstream notification channels

Pre-conditionsPhysical NFC badge paired with a registered user. Notification preferences set to email + SMS + voice.StepsTap badge on the touchscreen. Verify each downstream channel (email server, SMS gateway, voice API) receives the correct payload within the expected window.ExpectedAll three channels deliver. Payload matches the badge holder's profile. Failure on one channel does not block delivery on the others.

Testing Methodology

The vocabulary, briefly, for the ways I approach a test plan.

Boundary Value AnalysisTest the values on either side of every threshold. Off-by-one errors live here.

Equivalence PartitioningSplit inputs into groups that should behave the same. Test one representative per group, not every value.

Exploratory TestingUnscripted, hypothesis-driven testing. Best for new features and for after a release when the spec doesn't cover what users will actually do.

Negative TestingExplicit tests for invalid inputs, broken states, and unauthorized actions. The happy path is the smallest part of any feature.

Regression TestingAutomated coverage of the critical flows so manual time can focus on what changed and what's new.

Risk-Based TestingTest depth follows business impact. Payments, security, and data integrity get more attention than copy or styling.

How I Think About Testing

A few things I've come to believe after four years of doing this.

The UI is a liar.

When a calculation looks right on screen, that's the moment to check the math by hand. The UI is the last place a bug gets revealed, not the first. Most of the real bugs are in the layer underneath, and the UI is doing its best to hide them.

Test what you can't easily test.

If something is expensive to test, that's usually the most important thing to test. Capacity. Real hardware. Production-equivalent load. The bugs that ship are almost always the ones nobody wanted to bother reproducing.

Automation is a tool, not a strategy.

I use Playwright every day, but automation isn't the whole job. The best setup I've worked with combines targeted automation that guards the critical flows with manual exploration that finds the edge cases. "100% automated" QA tends to be 100% green and still missing the real bugs.

If you can't reproduce it, it isn't fixed.

"Works on my machine" is the start of a bug investigation, not the end of one. Half the value of QA is the discipline of reproduction, the part where you actually find the conditions that cause the bug.

// playwright + cdp · multi-user electron automationspecs/multi-user/screen-share.spec.ts

import { test, expect } from '@playwright/test';
import { attachToElectronApp, injectSession } from '../helpers/cdp';

test('two users in Power House, host shares screen with sound', async () => {
  // each user runs in its own browser profile
  // session cookies injected per profile so reloads survive
  const host  = await attachToElectronApp({ profile: 'zara' });
  const guest = await attachToElectronApp({ profile: 'raiha' });

  await injectSession(host,  'host.session.json');
  await injectSession(guest, 'guest.session.json');

  await host.joinRoom('power-house');
  await guest.joinRoom('power-house');

  await host.startScreenShare({ withAudio: true });

  // guest should see the publisher within 3s
  await expect(guest.publisherTile('zara')).toBeVisible({ timeout: 3000 });
  await expect(guest.audioLevel('zara')).toBeGreaterThan(0);
});

—

Currently Learning

Most of my AI testing so far has been manual: comparing model output against source content, watching for the failure modes, flagging patterns. The next direction I'm pushing on is making that more systematic, treating LLM outputs less like single answers to grade and more like distributions to measure.

Reading and experimenting with:

Promptfoo and LangSmith for prompt regression and structured evals
The OpenAI evals framework and how teams build ground-truth datasets
Where deterministic QA breaks down for LLM outputs, and what replaces it

Stack

automationPlaywright (TypeScript), CDP for Electron, Cypress, Selenium with Appium

api & perfPostman (collections, contract testing), JMeter (load, stress, soak)

ai_qualityLLM output validation, prompt regression, ranking-quality checks

paymentsStripe subscription lifecycle, webhook event validation

workflowJira, Git/GitHub, Agile/Scrum, CI-integrated test runs

languagesTypeScript, JavaScript, Python