Preparing for a Cassandra DBA or backend engineer interview can be daunting — the architecture is fundamentally different from relational databases, and interviewers expect you to demonstrate real operational instincts, not just textbook knowledge. This handbook compiles 200 questions across 10 chapters, covering everything you’d encounter in a senior-level interview for roles involving Apache Cassandra.

💡
How to use this guide Each section covers a chapter from the handbook. Click any question to reveal the answer with CQL examples and best practices. Use the sidebar quiz links to test your knowledge chapter by chapter.

OverviewCassandra Ring Architecture

Before diving into questions, here’s a visual overview of Cassandra’s distributed ring architecture — the foundation for most interview topics.

Cassandra Cluster — Ring Topology (3-Node, RF=3)
NODE 1 Token: 0–120 UN NODE 2 Token: 121–240 UN NODE 3 Token: 241–360 UN Gossip Protocol Client Driver Request (CL=QUORUM) Coordinator RF = 3 Each partition replicated to 3 nodes Quorum = 2 acks Write Path 1. CommitLog 2. Memtable 3. Flush → SSTable 4. Compaction 5. Bloom Filter Each node holds 256 virtual token ranges (vnodes) by default
Multi-Datacenter Architecture — NetworkTopologyStrategy
DC1 — Primary Region Node 1 UN Node 2 UN Node 3 UN RF = 3 | LOCAL_QUORUM DC2 — DR Region Node 4 UN Node 5 UN Node 6 UN RF = 3 | LOCAL_QUORUM Async Replication (Hinted Handoff)

Chapter 1 · Q1–Q20Data Modeling

Data modeling is the most frequently tested Cassandra topic. Unlike relational databases, Cassandra requires you to model for your queries — not for normalization. Master partition key design, clustering order, and denormalization strategies.

Q1 How would you design partitions for high write throughput in time-series data?

Avoid raw timestamps as partition keys — they create hotspots where all writes hammer a single node. Use composite keys with time bucketing to distribute load.

CREATE TABLE sensor_data (
  device_id TEXT,
  day       DATE,
  ts        TIMESTAMP,
  reading   DOUBLE,
  PRIMARY KEY ((device_id, day), ts)
) WITH CLUSTERING ORDER BY (ts DESC);
Best Practice

Keep partitions 10–200MB. Use nodetool tablestats to monitor partition sizes. Descending clustering order serves recent-data queries efficiently.

Q5 What are the trade-offs of using collections (list, set, map)?

Collections are rewritten in full on updates — expensive for large sets. Use them only for small metadata.

CREATE TABLE user_profile (
  user_id     TEXT PRIMARY KEY,
  emails      SET<TEXT>,
  preferences MAP<TEXT, TEXT>
);
Warning

Model unbounded collections as separate child tables. Never store collections with thousands of items in a single column.

Q8 How do you choose partition keys to avoid hotspots?

Sequential IDs and raw timestamps concentrate writes on a single node. Use UUIDs combined with time bucketing for even distribution.

CREATE TABLE metrics (
  metric_id  UUID,
  bucket_day DATE,
  ts         TIMESTAMP,
  value      DOUBLE,
  PRIMARY KEY ((metric_id, bucket_day), ts)
);
Best Practice

Monitor token distribution with nodetool status. Ensure your driver is configured for token-aware load balancing.

Q14 How do you enforce uniqueness in Cassandra?

Cassandra doesn’t have native UNIQUE constraints. Use Lightweight Transactions (LWT) with IF NOT EXISTS for linearizable inserts.

INSERT INTO users (id, email)
VALUES ('u1', 'x@test.com')
IF NOT EXISTS;
Performance Impact

LWTs use Paxos consensus and carry a ~4× latency overhead. Use sparingly — only for critical uniqueness constraints like user registration or idempotency keys.

Q16 How do you model many-to-many relationships?

Use two denormalized tables — one per query direction. Accept the duplicate storage cost as the price of scalable reads.

CREATE TABLE student_courses (
  student_id TEXT,
  course_id  TEXT,
  PRIMARY KEY ((student_id), course_id)
);

CREATE TABLE course_students (
  course_id  TEXT,
  student_id TEXT,
  PRIMARY KEY ((course_id), student_id)
);
Best Practice

Keep dual writes idempotent. Consider batching both inserts with a BATCH statement for atomicity (not performance).

Q19 What is the impact of TTL in schema design?

TTL creates tombstones when data expires. High tombstone counts slow reads and stress compaction.

INSERT INTO sessions (id, data)
VALUES ('s1', 'abc')
USING TTL 3600;
Gotcha

For TTL-heavy workloads (session caches, IoT data), use TimeWindowCompactionStrategy (TWCS) which organizes SSTables by time window and efficiently garbage-collects expired tombstones.

🧠 Test Your Data Modeling Knowledge

Think you know Cassandra data modeling? Take the quiz and find out where you stand.

Chapter 2 · Q21–Q40Replication & Consistency

Consistency levels are one of the most tested areas. You must understand the trade-off triangle between consistency, availability, and latency — and know exactly when to use each level.

Consistency Levels — Latency vs Durability Trade-off
Consistency Strength → Latency → ONE Fastest Low Latency Stale reads possible LOCAL_ QUORUM ⭐ Recommended RF/2+1 in local DC QUORUM Cross-DC Cross-DC Higher latency ALL All replicas Highest Consistency Avoid in production
Q23 What are the trade-offs between consistency levels ONE, QUORUM, and ALL?

ONE — acknowledges after 1 replica responds. Fastest but risks stale reads if another replica has newer data.

QUORUM — requires RF/2+1 replicas to acknowledge. Balances consistency and availability. Works across DCs.

ALL — all replicas must respond. Strongest consistency, highest latency, fragile — one downed replica fails the request.

Production Rule

Use LOCAL_QUORUM for multi-DC deployments. It restricts acknowledgment to the local DC, avoiding cross-DC latency while maintaining quorum consistency.

Q25 What is hinted handoff and when should you disable it?

When a target replica is temporarily down, the coordinator stores a “hint” and replays the write once the node recovers. This preserves write availability during short outages.

# cassandra.yaml
hinted_handoff_enabled: true
max_hint_window_in_ms: 10800000  # 3 hours
Disable when

Disable hinted handoff if a node is down for more than 3 hours — accumulated hints cause a replay storm on recovery that can overwhelm the rejoining node. Run a repair instead.

Q30 How does consistency work with Lightweight Transactions (LWT)?

LWTs use the Paxos protocol to provide linearizable consistency — a stronger guarantee than Cassandra’s eventual consistency model.

INSERT INTO users (id, email)
VALUES ('u1', 'x@test.com')
IF NOT EXISTS;

-- Result:
-- [applied] = true  (success)
-- [applied] = false (row already exists)
Performance

LWTs incur ~4× latency overhead vs regular writes. Reserve for registration, idempotency keys, and financial transactions where exactly-once semantics are critical.

Q38 How does Cassandra handle conflicting writes?

Cassandra uses last-write-wins (LWW) based on client-side timestamps. The write with the highest timestamp value wins, regardless of which replica received it first.

Clock Skew Risk

If client clocks are out of sync (NTP drift), an older write from a node with a fast clock can override a newer write. Keep all nodes and clients synchronized with NTP/Chrony — aim for <1ms drift.

🧠 Test Replication & Consistency

Consistency levels are the most common interview topic. Test yourself now.

Chapter 3 · Q41–Q60Compaction & Repairs

Compaction and repair are the operational heart of Cassandra. Many production incidents trace back to misconfigured compaction strategies or neglected repairs. Know the three strategies cold.

Compaction Strategies — Use Case Matrix
STCS SizeTieredCompaction ✓ Write-heavy workloads ✓ Default strategy ✗ High read amplification Groups similar-size SSTables LCS LeveledCompaction ✓ Read-heavy workloads ✓ Low read amplification ✗ High write amplification Leveled tiers, 1 SSTable/level TWCS TimeWindowCompaction ✓ TTL / time-series data ✓ IoT / logs / sessions ✗ No random deletes Fixed time-window buckets
Q44 How does TimeWindowCompactionStrategy (TWCS) work?

TWCS organizes SSTables into fixed time windows. SSTables within a window are merged together, and old windows are rarely touched once compacted. Expired TTL data is efficiently dropped at window boundary.

ALTER TABLE iot_readings
  WITH compaction = {
    'class': 'TimeWindowCompactionStrategy',
    'compaction_window_unit': 'DAYS',
    'compaction_window_size': '1'
  };
Use When

Ideal for IoT sensor data, session caches, application logs, and any time-series data where you use TTL. Avoids the tombstone accumulation problems that STCS/LCS suffer with TTL.

Q45 How do tombstones affect read performance?

Tombstones are deletion markers. During a read, Cassandra must scan through all tombstones to find live data. High tombstone counts cause read timeouts and trigger TombstoneOverwhelmingException.

Warning Signs

Watch Cassandra logs for: Read X live rows and Y tombstone cells.... More than 100,000 tombstones per partition is a serious problem.

# Check tombstone stats
$ nodetool tablestats keyspace.table
Best Practices

Minimize frequent deletes, avoid large TTLs on wide tables, use TWCS for TTL-heavy workloads, and keep gc_grace_seconds tuned to your repair schedule.

Q50 How does gc_grace_seconds impact repairs and tombstones?

gc_grace_seconds defines how long Cassandra waits before purging tombstones during compaction. Default is 10 days.

Zombie Row Risk

If a node is down when a delete happens, it never sees the tombstone. When it comes back, it may “resurrect” deleted rows — called zombie rows. The gc_grace_seconds window gives you time to run a repair before tombstones are garbage collected.

Rule

Always ensure gc_grace_seconds ≥ your repair interval. If you repair weekly, keep gc_grace at 10 days minimum.

🧠 Test Compaction Knowledge

Compaction strategy selection is a senior-level skill. Can you pick the right one?

Chapter 4 · Q61–Q80Cluster Management

Operators are expected to bootstrap nodes safely, decommission without data loss, and manage topology changes with minimal impact. These questions probe your hands-on operational knowledge.

Q61 What happens during Cassandra node bootstrap?

Bootstrap is the process of a new node joining the cluster and receiving its share of data:

1. Node contacts seed nodes and joins gossip ring. 2. Token assignment (via vnodes or manual). 3. Streaming begins from existing replicas. 4. Node transitions to UN (Up/Normal).

$ nodetool status
# UN = Up Normal (healthy)
# UJ = Up Joining (streaming)
# DL = Down Leaving (decommission)
Monitor with

Use nodetool netstats to watch streaming progress during bootstrap. Never interrupt bootstrap — it leaves orphaned data.

Q63 How do you safely decommission a node?

Decommission streams all data that the node owns to its remaining replicas, then removes it from the ring cleanly.

# Step 1: Run repair first
$ nodetool repair

# Step 2: Decommission
$ nodetool decommission

# Monitor progress
$ nodetool netstats
Never do

Don’t decommission multiple nodes from the same DC simultaneously — you risk dropping below replication factor and losing data. Always wait for the first node to fully decommission before starting the next.

Q70 How do vnodes simplify cluster management?

Virtual nodes (vnodes) assign multiple token ranges to each physical node, rather than one contiguous range. Default is 256 tokens per node.

# cassandra.yaml
num_tokens: 256  # default
# For large clusters, reduce to 16-32
num_tokens: 16
Benefits

With vnodes: adding/removing a node automatically redistributes data evenly across the ring, with no manual token calculation. Reduces strain during bootstrap. For very large clusters (100+ nodes), reduce to 16–32 tokens to improve repair speed.

Chapter 5 · Q81–Q100Performance Tuning

Performance tuning interviews go deep: JVM heap, GC strategy, compaction throughput, thread pools, and read/write path optimization. Know your nodetool commands and what each metric means.

Q81 How do you tune JVM heap size in Cassandra?

Heap size directly impacts GC pause duration. The sweet spot is 8GB — large enough for caches but small enough to keep GC pauses below 500ms.

# jvm.options
-Xms8G
-Xmx8G
-XX:+UseG1GC
-XX:G1RSetUpdatingPauseTimePercent=5
-XX:MaxGCPauseMillis=300
Rules

Heap ≤ 50% of RAM. Never exceed 16GB — larger heaps cause longer GC pauses. Use G1GC (default in newer JDKs). Enable GC logging and monitor pause times.

Q87 How do concurrent_reads and concurrent_writes affect performance?

These define the thread pool sizes for read and write request handling. Setting them too low causes request queuing; too high causes CPU contention.

# cassandra.yaml
concurrent_reads:  32
concurrent_writes: 32
# Rule: ~8x number of CPU cores
Monitor

Use nodetool tpstats and watch the Pending column. If pending tasks grow continuously, your thread pools are undersized. Scale horizontally before maxing out thread pools.

Q91 What are signs of GC pressure in Cassandra?

GC pressure is one of the top causes of latency spikes in Cassandra clusters.

Symptoms: GC pauses >200ms in gc.log, heap utilization consistently above 80%, increasing read/write P99 latency, nodetool tpstats showing dropped mutations.

$ tail -f /var/log/cassandra/gc.log
# Look for: [GC pause (G1 Evacuation) X.XXX secs]
# Dangerous: pauses > 1 second
Fix Strategy

Reduce heap if above 8GB, investigate large partitions (they generate heap pressure during reads), tune G1GC, and add nodes to distribute load.

🧠 Test JVM & Performance Knowledge

JVM tuning separates good DBAs from great ones. Test your depth here.

Chapter 6 · Q101–Q120Security & Backup

Q101 How do you enable authentication in Cassandra?
# cassandra.yaml
authenticator: PasswordAuthenticator
authorizer:    CassandraAuthorizer
# Connect via cqlsh
$ cqlsh -u cassandra -p cassandra

# Immediately change default password
ALTER ROLE cassandra WITH PASSWORD = 'new_secure_password';
Security

Never leave the default cassandra/cassandra credentials in place. For enterprise deployments, integrate with LDAP or Kerberos for centralized identity management.

Q107 How do you perform a full backup in Cassandra?

Cassandra snapshots use hard links to SSTable files — they’re instant and space-efficient (only diverging data uses additional space).

# Full snapshot of specific keyspace
$ nodetool snapshot -t backup_20250501 my_keyspace

# Snapshots created at:
# /var/lib/cassandra/data/keyspace/table/snapshots/

# Copy to object storage
$ aws s3 sync /var/lib/cassandra/data/ s3://my-bucket/cassandra-backup/
Backup Strategy

Combine full weekly snapshots with daily incremental backups. Always back up system_auth and system_schema keyspaces — they contain roles, permissions, and schema definitions.

Q120 What is the difference between nodetool snapshot and incremental backups?

Snapshot — full point-in-time copy via hard links. Fast to create, captures entire table state.

Incremental backup — automatically copies only newly flushed SSTables to a backups/ directory after each memtable flush.

# cassandra.yaml — enable incremental backups
incremental_backups: true
Best Practice

Use both: weekly full snapshot + daily/hourly incremental. Test restores periodically using sstableloader in a staging cluster. A backup you haven’t tested is not a backup.

🧠 Test Security & Backup Knowledge

Chapter 7 · Q121–Q140Monitoring & Troubleshooting

Q121 How do you check cluster health in Cassandra?
$ nodetool status

Datacenter: dc1
=================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns    Host ID   Rack
UN  10.0.0.1       120.5 GiB  256     33.3%   abc123    rack1
UN  10.0.0.2       118.2 GiB  256     33.3%   def456    rack1
UN  10.0.0.3       121.0 GiB  256     33.4%   ghi789    rack1
Healthy indicators

All nodes show UN (Up/Normal), load is balanced across nodes (no node has dramatically more data), and token ownership is roughly equal. Alert on DN (Down/Normal) or DL (Down/Leaving).

Q130 How do you identify hot partitions?

Hot partitions receive disproportionately more reads/writes than others, causing uneven load and latency spikes on specific nodes.

# Check partition sizes
$ nodetool tablestats keyspace.table

# Look for: Large Partition warning in system.log
# WARNING: Writing large partition key: 45 MB
Fix

Redesign the partition key to include a bucketing component (e.g., append bucket_day or a hash suffix). Enable token-aware routing in your driver to ensure requests go directly to the correct replica.

Q133 How do you detect dropped mutations?

Dropped mutations are write requests that timed out internally — the node couldn’t process them fast enough. This is a serious signal of an overloaded cluster.

$ nodetool tpstats

Pool Name              Active  Pending  Blocked  Dropped
MutationStage               2       45       0      128  ← Problem!
ReadStage                   4        3       0        0
Causes

Compaction backlog consuming I/O, JVM GC pressure, undersized thread pools, insufficient disk throughput. Investigate in this order: GC → compaction → disk I/O → thread pools.

🧠 Test Monitoring & Troubleshooting

Chapter 8 · Q141–Q160Multi-DC Operations

Q141 How do you configure Cassandra for multi-DC replication?
CREATE KEYSPACE production
  WITH REPLICATION = {
    'class': 'NetworkTopologyStrategy',
    'dc1': 3,
    'dc2': 3
  };
Always

Use NetworkTopologyStrategy in production — never SimpleStrategy. Define RF per DC. This allows you to tune replication independently per region and enables graceful DC-level failover.

Q149 How do you handle a full DC outage?

With NetworkTopologyStrategy RF=3 in each DC and LOCAL_QUORUM consistency, a full DC outage is transparent to the application — the surviving DC continues serving reads and writes.

# App uses LOCAL_QUORUM — DC2 continues independently
CONSISTENCY LOCAL_QUORUM;
Recovery Steps

When DC is restored: 1) Run nodetool rebuild to stream missed data from surviving DC. 2) Run nodetool repair to ensure full consistency. 3) Verify with nodetool status.

Q148 Why is clock synchronization critical in Cassandra?

Cassandra resolves write conflicts using timestamps (last-write-wins). If clocks drift between nodes or clients, an older write with a higher timestamp can silently override a newer one.

# Check NTP sync on all nodes
$ chronyc tracking
$ timedatectl status

# Target: <1ms offset across all nodes
Production Rule

Run NTP or Chrony on every Cassandra node AND every application client. Monitor clock drift as a metrics alert. Even 10ms drift can cause subtle data correctness bugs in high-throughput applications.

🧠 Test Multi-DC Knowledge

Chapter 9 · Q161–Q180Operations at Scale & Automation

Q162 How do you deploy Cassandra on Kubernetes?

Use StatefulSets with PersistentVolumeClaims — Cassandra nodes require stable network identities and persistent storage that survives pod restarts.

# K8ssandra — production-grade Cassandra on K8s
$ helm repo add k8ssandra https://helm.k8ssandra.io/stable
$ helm install k8ssandra-operator k8ssandra/k8ssandra-operator

# Deploy cluster
$ kubectl apply -f cassandra-cluster.yaml
Anti-affinity

Always configure pod anti-affinity rules so Cassandra pods don’t land on the same physical node — a single hardware failure should never take down multiple replicas.

Q170 How do you perform capacity planning in Cassandra?

Capacity planning requires understanding your write rate, data growth, and compaction overhead — Cassandra temporarily needs 2× disk during compaction.

Disk formula: raw_data × RF × 1.5 (compaction headroom) / nodes

# Benchmark with cassandra-stress
$ cassandra-stress write n=1000000 -rate threads=50
$ cassandra-stress read n=500000 -rate threads=50
Targets

Disk <70% utilization. CPU <60% average. Alert at 80% disk. Plan 6–12 months ahead. Add nodes before you’re full — rebalancing under pressure is dangerous.

Chapter 10 · Q181–Q200Case Studies & Real-World Scenarios

Senior interviews often end with system design scenarios. These questions test whether you can translate Cassandra knowledge into real architectural decisions.

Q182 How do you handle IoT sensor data ingestion at scale?

IoT is the canonical Cassandra use case — high write volume, time-ordered data, per-device access patterns, and TTL-based expiry.

CREATE TABLE iot_data (
  device_id TEXT,
  day       DATE,
  ts        TIMESTAMP,
  reading   DOUBLE,
  PRIMARY KEY ((device_id, day), ts)
) WITH
  CLUSTERING ORDER BY (ts DESC)
  AND compaction = {'class': 'TimeWindowCompactionStrategy',
                     'compaction_window_unit': 'HOURS',
                     'compaction_window_size': '1'}
  AND default_time_to_live = 2592000;  -- 30 days TTL
Complete Pattern

Device + day bucket prevents hot partitions. TWCS + TTL efficiently expires old sensor data. LOCAL_QUORUM for writes ensures consistency without cross-DC latency. Monitor for device skew — some devices may write 100× more than others.

Q186 How would you design Cassandra for a chat/messaging app?

Messaging requires two access patterns: “show messages in a conversation” and “show all conversations for a user”. Each needs its own table.

-- Messages in a conversation (main view)
CREATE TABLE messages_by_conversation (
  conversation_id UUID,
  sent_at         TIMESTAMP,
  sender_id       TEXT,
  content         TEXT,
  PRIMARY KEY ((conversation_id), sent_at)
) WITH CLUSTERING ORDER BY (sent_at DESC);

-- User's conversation list
CREATE TABLE conversations_by_user (
  user_id         TEXT,
  last_message_at TIMESTAMP,
  conversation_id UUID,
  PRIMARY KEY ((user_id), last_message_at)
) WITH CLUSTERING ORDER BY (last_message_at DESC);
Design Note

Write to both tables atomically using a BATCH. Use driver-side paging for scrollback (never LIMIT loops). Apply TTL to archive old messages to cold storage.

Q200 How do you summarize Cassandra best practices from real-world ops?

The golden rules that distinguish production-hardened Cassandra deployments from brittle ones:

Model for queries, not for relations. Cassandra rewards denormalization.

Always run repairs aligned with gc_grace_seconds. Neglecting repair causes zombie rows and data divergence.

Monitor the three killers: latency (P99), dropped mutations, and compaction backlog.

RF=3 minimum, LOCAL_QUORUM everywhere. Non-negotiable for production HA.

Test your backups. An untested restore is no restore. Run DR drills quarterly.

Key Takeaways from This Handbook

  • Design partition keys to distribute load evenly — avoid sequential IDs and raw timestamps as partition keys
  • Use LOCAL_QUORUM for multi-DC production workloads; it gives you consistency without cross-DC latency
  • Match your compaction strategy to your workload: STCS (write-heavy), LCS (read-heavy), TWCS (TTL/time-series)
  • Never neglect repair — run incremental repairs daily and full repairs weekly, aligned to gc_grace_seconds
  • Keep JVM heap at 8GB max with G1GC; monitor GC pauses and dropped mutations as your primary health signals
  • For multi-DC, use NetworkTopologyStrategy with RF=3 per DC and plan for graceful failover with LOCAL_QUORUM
  • Test backups by actually restoring them in staging — snapshots + incremental, weekly + daily cadence

🎯 Ready for Your Interview?

Test your complete Cassandra knowledge with the full quiz hub — covering all 10 chapters.

Cassandra Complete — Series Navigation

01 Cassandra Architecture & Operations Interview Handbook 📍 You are here
02 Apache Cassandra Installation on Ubuntu — Step by Step ⬜ Coming
03 Setting Up a 3-Node Cassandra Cluster — Complete Guide ⬜ Coming
04 Cassandra on Kubernetes with K8ssandra — Complete Guide ⬜ Coming

References

Apache Cassandra Official Documentation · Nodetool Reference · Cassandra Reaper (Repairs)