Cassandra Architecture & Operations
Interview Handbook
200 curated interview questions with real-world answers, CQL examples, and best practices — covering everything from data modeling to multi-DC operations at scale.
Preparing for a Cassandra DBA or backend engineer interview can be daunting — the architecture is fundamentally different from relational databases, and interviewers expect you to demonstrate real operational instincts, not just textbook knowledge. This handbook compiles 200 questions across 10 chapters, covering everything you’d encounter in a senior-level interview for roles involving Apache Cassandra.
OverviewCassandra Ring Architecture
Before diving into questions, here’s a visual overview of Cassandra’s distributed ring architecture — the foundation for most interview topics.
Chapter 1 · Q1–Q20Data Modeling
Data modeling is the most frequently tested Cassandra topic. Unlike relational databases, Cassandra requires you to model for your queries — not for normalization. Master partition key design, clustering order, and denormalization strategies.
Avoid raw timestamps as partition keys — they create hotspots where all writes hammer a single node. Use composite keys with time bucketing to distribute load.
CREATE TABLE sensor_data ( device_id TEXT, day DATE, ts TIMESTAMP, reading DOUBLE, PRIMARY KEY ((device_id, day), ts) ) WITH CLUSTERING ORDER BY (ts DESC);Best Practice
Keep partitions 10–200MB. Use nodetool tablestats to monitor partition sizes. Descending clustering order serves recent-data queries efficiently.
Collections are rewritten in full on updates — expensive for large sets. Use them only for small metadata.
CREATE TABLE user_profile ( user_id TEXT PRIMARY KEY, emails SET<TEXT>, preferences MAP<TEXT, TEXT> );Warning
Model unbounded collections as separate child tables. Never store collections with thousands of items in a single column.
Sequential IDs and raw timestamps concentrate writes on a single node. Use UUIDs combined with time bucketing for even distribution.
CREATE TABLE metrics ( metric_id UUID, bucket_day DATE, ts TIMESTAMP, value DOUBLE, PRIMARY KEY ((metric_id, bucket_day), ts) );Best Practice
Monitor token distribution with nodetool status. Ensure your driver is configured for token-aware load balancing.
Cassandra doesn’t have native UNIQUE constraints. Use Lightweight Transactions (LWT) with IF NOT EXISTS for linearizable inserts.
INSERT INTO users (id, email) VALUES ('u1', 'x@test.com') IF NOT EXISTS;Performance Impact
LWTs use Paxos consensus and carry a ~4× latency overhead. Use sparingly — only for critical uniqueness constraints like user registration or idempotency keys.
Use two denormalized tables — one per query direction. Accept the duplicate storage cost as the price of scalable reads.
CREATE TABLE student_courses ( student_id TEXT, course_id TEXT, PRIMARY KEY ((student_id), course_id) ); CREATE TABLE course_students ( course_id TEXT, student_id TEXT, PRIMARY KEY ((course_id), student_id) );Best Practice
Keep dual writes idempotent. Consider batching both inserts with a BATCH statement for atomicity (not performance).
TTL creates tombstones when data expires. High tombstone counts slow reads and stress compaction.
INSERT INTO sessions (id, data) VALUES ('s1', 'abc') USING TTL 3600;Gotcha
For TTL-heavy workloads (session caches, IoT data), use TimeWindowCompactionStrategy (TWCS) which organizes SSTables by time window and efficiently garbage-collects expired tombstones.
Chapter 2 · Q21–Q40Replication & Consistency
Consistency levels are one of the most tested areas. You must understand the trade-off triangle between consistency, availability, and latency — and know exactly when to use each level.
ONE — acknowledges after 1 replica responds. Fastest but risks stale reads if another replica has newer data.
QUORUM — requires RF/2+1 replicas to acknowledge. Balances consistency and availability. Works across DCs.
ALL — all replicas must respond. Strongest consistency, highest latency, fragile — one downed replica fails the request.
Production RuleUse LOCAL_QUORUM for multi-DC deployments. It restricts acknowledgment to the local DC, avoiding cross-DC latency while maintaining quorum consistency.
When a target replica is temporarily down, the coordinator stores a “hint” and replays the write once the node recovers. This preserves write availability during short outages.
# cassandra.yaml hinted_handoff_enabled: true max_hint_window_in_ms: 10800000 # 3 hoursDisable when
Disable hinted handoff if a node is down for more than 3 hours — accumulated hints cause a replay storm on recovery that can overwhelm the rejoining node. Run a repair instead.
LWTs use the Paxos protocol to provide linearizable consistency — a stronger guarantee than Cassandra’s eventual consistency model.
INSERT INTO users (id, email) VALUES ('u1', 'x@test.com') IF NOT EXISTS; -- Result: -- [applied] = true (success) -- [applied] = false (row already exists)Performance
LWTs incur ~4× latency overhead vs regular writes. Reserve for registration, idempotency keys, and financial transactions where exactly-once semantics are critical.
Cassandra uses last-write-wins (LWW) based on client-side timestamps. The write with the highest timestamp value wins, regardless of which replica received it first.
Clock Skew RiskIf client clocks are out of sync (NTP drift), an older write from a node with a fast clock can override a newer write. Keep all nodes and clients synchronized with NTP/Chrony — aim for <1ms drift.
Chapter 3 · Q41–Q60Compaction & Repairs
Compaction and repair are the operational heart of Cassandra. Many production incidents trace back to misconfigured compaction strategies or neglected repairs. Know the three strategies cold.
TWCS organizes SSTables into fixed time windows. SSTables within a window are merged together, and old windows are rarely touched once compacted. Expired TTL data is efficiently dropped at window boundary.
ALTER TABLE iot_readings WITH compaction = { 'class': 'TimeWindowCompactionStrategy', 'compaction_window_unit': 'DAYS', 'compaction_window_size': '1' };Use When
Ideal for IoT sensor data, session caches, application logs, and any time-series data where you use TTL. Avoids the tombstone accumulation problems that STCS/LCS suffer with TTL.
Tombstones are deletion markers. During a read, Cassandra must scan through all tombstones to find live data. High tombstone counts cause read timeouts and trigger TombstoneOverwhelmingException.
Watch Cassandra logs for: Read X live rows and Y tombstone cells.... More than 100,000 tombstones per partition is a serious problem.
# Check tombstone stats
$ nodetool tablestats keyspace.table
Best Practices
Minimize frequent deletes, avoid large TTLs on wide tables, use TWCS for TTL-heavy workloads, and keep gc_grace_seconds tuned to your repair schedule.
gc_grace_seconds defines how long Cassandra waits before purging tombstones during compaction. Default is 10 days.
If a node is down when a delete happens, it never sees the tombstone. When it comes back, it may “resurrect” deleted rows — called zombie rows. The gc_grace_seconds window gives you time to run a repair before tombstones are garbage collected.
Always ensure gc_grace_seconds ≥ your repair interval. If you repair weekly, keep gc_grace at 10 days minimum.
Chapter 4 · Q61–Q80Cluster Management
Operators are expected to bootstrap nodes safely, decommission without data loss, and manage topology changes with minimal impact. These questions probe your hands-on operational knowledge.
Bootstrap is the process of a new node joining the cluster and receiving its share of data:
1. Node contacts seed nodes and joins gossip ring. 2. Token assignment (via vnodes or manual). 3. Streaming begins from existing replicas. 4. Node transitions to UN (Up/Normal).
$ nodetool status # UN = Up Normal (healthy) # UJ = Up Joining (streaming) # DL = Down Leaving (decommission)Monitor with
Use nodetool netstats to watch streaming progress during bootstrap. Never interrupt bootstrap — it leaves orphaned data.
Decommission streams all data that the node owns to its remaining replicas, then removes it from the ring cleanly.
# Step 1: Run repair first $ nodetool repair # Step 2: Decommission $ nodetool decommission # Monitor progress $ nodetool netstatsNever do
Don’t decommission multiple nodes from the same DC simultaneously — you risk dropping below replication factor and losing data. Always wait for the first node to fully decommission before starting the next.
Virtual nodes (vnodes) assign multiple token ranges to each physical node, rather than one contiguous range. Default is 256 tokens per node.
# cassandra.yaml num_tokens: 256 # default # For large clusters, reduce to 16-32 num_tokens: 16Benefits
With vnodes: adding/removing a node automatically redistributes data evenly across the ring, with no manual token calculation. Reduces strain during bootstrap. For very large clusters (100+ nodes), reduce to 16–32 tokens to improve repair speed.
Chapter 5 · Q81–Q100Performance Tuning
Performance tuning interviews go deep: JVM heap, GC strategy, compaction throughput, thread pools, and read/write path optimization. Know your nodetool commands and what each metric means.
Heap size directly impacts GC pause duration. The sweet spot is 8GB — large enough for caches but small enough to keep GC pauses below 500ms.
# jvm.options
-Xms8G
-Xmx8G
-XX:+UseG1GC
-XX:G1RSetUpdatingPauseTimePercent=5
-XX:MaxGCPauseMillis=300
Rules
Heap ≤ 50% of RAM. Never exceed 16GB — larger heaps cause longer GC pauses. Use G1GC (default in newer JDKs). Enable GC logging and monitor pause times.
These define the thread pool sizes for read and write request handling. Setting them too low causes request queuing; too high causes CPU contention.
# cassandra.yaml concurrent_reads: 32 concurrent_writes: 32 # Rule: ~8x number of CPU coresMonitor
Use nodetool tpstats and watch the Pending column. If pending tasks grow continuously, your thread pools are undersized. Scale horizontally before maxing out thread pools.
GC pressure is one of the top causes of latency spikes in Cassandra clusters.
Symptoms: GC pauses >200ms in gc.log, heap utilization consistently above 80%, increasing read/write P99 latency, nodetool tpstats showing dropped mutations.
$ tail -f /var/log/cassandra/gc.log # Look for: [GC pause (G1 Evacuation) X.XXX secs] # Dangerous: pauses > 1 secondFix Strategy
Reduce heap if above 8GB, investigate large partitions (they generate heap pressure during reads), tune G1GC, and add nodes to distribute load.
Chapter 6 · Q101–Q120Security & Backup
# cassandra.yaml
authenticator: PasswordAuthenticator
authorizer: CassandraAuthorizer
# Connect via cqlsh $ cqlsh -u cassandra -p cassandra # Immediately change default password ALTER ROLE cassandra WITH PASSWORD = 'new_secure_password';Security
Never leave the default cassandra/cassandra credentials in place. For enterprise deployments, integrate with LDAP or Kerberos for centralized identity management.
Cassandra snapshots use hard links to SSTable files — they’re instant and space-efficient (only diverging data uses additional space).
# Full snapshot of specific keyspace $ nodetool snapshot -t backup_20250501 my_keyspace # Snapshots created at: # /var/lib/cassandra/data/keyspace/table/snapshots/ # Copy to object storage $ aws s3 sync /var/lib/cassandra/data/ s3://my-bucket/cassandra-backup/Backup Strategy
Combine full weekly snapshots with daily incremental backups. Always back up system_auth and system_schema keyspaces — they contain roles, permissions, and schema definitions.
Snapshot — full point-in-time copy via hard links. Fast to create, captures entire table state.
Incremental backup — automatically copies only newly flushed SSTables to a backups/ directory after each memtable flush.
# cassandra.yaml — enable incremental backups incremental_backups: trueBest Practice
Use both: weekly full snapshot + daily/hourly incremental. Test restores periodically using sstableloader in a staging cluster. A backup you haven’t tested is not a backup.
Chapter 7 · Q121–Q140Monitoring & Troubleshooting
$ nodetool status
Datacenter: dc1
=================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 10.0.0.1 120.5 GiB 256 33.3% abc123 rack1
UN 10.0.0.2 118.2 GiB 256 33.3% def456 rack1
UN 10.0.0.3 121.0 GiB 256 33.4% ghi789 rack1
Healthy indicators
All nodes show UN (Up/Normal), load is balanced across nodes (no node has dramatically more data), and token ownership is roughly equal. Alert on DN (Down/Normal) or DL (Down/Leaving).
Hot partitions receive disproportionately more reads/writes than others, causing uneven load and latency spikes on specific nodes.
# Check partition sizes $ nodetool tablestats keyspace.table # Look for: Large Partition warning in system.log # WARNING: Writing large partition key: 45 MBFix
Redesign the partition key to include a bucketing component (e.g., append bucket_day or a hash suffix). Enable token-aware routing in your driver to ensure requests go directly to the correct replica.
Dropped mutations are write requests that timed out internally — the node couldn’t process them fast enough. This is a serious signal of an overloaded cluster.
$ nodetool tpstats
Pool Name Active Pending Blocked Dropped
MutationStage 2 45 0 128 ← Problem!
ReadStage 4 3 0 0
Causes
Compaction backlog consuming I/O, JVM GC pressure, undersized thread pools, insufficient disk throughput. Investigate in this order: GC → compaction → disk I/O → thread pools.
Chapter 8 · Q141–Q160Multi-DC Operations
CREATE KEYSPACE production WITH REPLICATION = { 'class': 'NetworkTopologyStrategy', 'dc1': 3, 'dc2': 3 };Always
Use NetworkTopologyStrategy in production — never SimpleStrategy. Define RF per DC. This allows you to tune replication independently per region and enables graceful DC-level failover.
With NetworkTopologyStrategy RF=3 in each DC and LOCAL_QUORUM consistency, a full DC outage is transparent to the application — the surviving DC continues serving reads and writes.
# App uses LOCAL_QUORUM — DC2 continues independently CONSISTENCY LOCAL_QUORUM;Recovery Steps
When DC is restored: 1) Run nodetool rebuild to stream missed data from surviving DC. 2) Run nodetool repair to ensure full consistency. 3) Verify with nodetool status.
Cassandra resolves write conflicts using timestamps (last-write-wins). If clocks drift between nodes or clients, an older write with a higher timestamp can silently override a newer one.
# Check NTP sync on all nodes $ chronyc tracking $ timedatectl status # Target: <1ms offset across all nodesProduction Rule
Run NTP or Chrony on every Cassandra node AND every application client. Monitor clock drift as a metrics alert. Even 10ms drift can cause subtle data correctness bugs in high-throughput applications.
Chapter 9 · Q161–Q180Operations at Scale & Automation
Use StatefulSets with PersistentVolumeClaims — Cassandra nodes require stable network identities and persistent storage that survives pod restarts.
# K8ssandra — production-grade Cassandra on K8s $ helm repo add k8ssandra https://helm.k8ssandra.io/stable $ helm install k8ssandra-operator k8ssandra/k8ssandra-operator # Deploy cluster $ kubectl apply -f cassandra-cluster.yamlAnti-affinity
Always configure pod anti-affinity rules so Cassandra pods don’t land on the same physical node — a single hardware failure should never take down multiple replicas.
Capacity planning requires understanding your write rate, data growth, and compaction overhead — Cassandra temporarily needs 2× disk during compaction.
Disk formula: raw_data × RF × 1.5 (compaction headroom) / nodes
# Benchmark with cassandra-stress
$ cassandra-stress write n=1000000 -rate threads=50
$ cassandra-stress read n=500000 -rate threads=50
Targets
Disk <70% utilization. CPU <60% average. Alert at 80% disk. Plan 6–12 months ahead. Add nodes before you’re full — rebalancing under pressure is dangerous.
Chapter 10 · Q181–Q200Case Studies & Real-World Scenarios
Senior interviews often end with system design scenarios. These questions test whether you can translate Cassandra knowledge into real architectural decisions.
IoT is the canonical Cassandra use case — high write volume, time-ordered data, per-device access patterns, and TTL-based expiry.
CREATE TABLE iot_data ( device_id TEXT, day DATE, ts TIMESTAMP, reading DOUBLE, PRIMARY KEY ((device_id, day), ts) ) WITH CLUSTERING ORDER BY (ts DESC) AND compaction = {'class': 'TimeWindowCompactionStrategy', 'compaction_window_unit': 'HOURS', 'compaction_window_size': '1'} AND default_time_to_live = 2592000; -- 30 days TTLComplete Pattern
Device + day bucket prevents hot partitions. TWCS + TTL efficiently expires old sensor data. LOCAL_QUORUM for writes ensures consistency without cross-DC latency. Monitor for device skew — some devices may write 100× more than others.
Messaging requires two access patterns: “show messages in a conversation” and “show all conversations for a user”. Each needs its own table.
-- Messages in a conversation (main view) CREATE TABLE messages_by_conversation ( conversation_id UUID, sent_at TIMESTAMP, sender_id TEXT, content TEXT, PRIMARY KEY ((conversation_id), sent_at) ) WITH CLUSTERING ORDER BY (sent_at DESC); -- User's conversation list CREATE TABLE conversations_by_user ( user_id TEXT, last_message_at TIMESTAMP, conversation_id UUID, PRIMARY KEY ((user_id), last_message_at) ) WITH CLUSTERING ORDER BY (last_message_at DESC);Design Note
Write to both tables atomically using a BATCH. Use driver-side paging for scrollback (never LIMIT loops). Apply TTL to archive old messages to cold storage.
The golden rules that distinguish production-hardened Cassandra deployments from brittle ones:
① Model for queries, not for relations. Cassandra rewards denormalization.
② Always run repairs aligned with gc_grace_seconds. Neglecting repair causes zombie rows and data divergence.
③ Monitor the three killers: latency (P99), dropped mutations, and compaction backlog.
④ RF=3 minimum, LOCAL_QUORUM everywhere. Non-negotiable for production HA.
⑤ Test your backups. An untested restore is no restore. Run DR drills quarterly.
Key Takeaways from This Handbook
- Design partition keys to distribute load evenly — avoid sequential IDs and raw timestamps as partition keys
- Use LOCAL_QUORUM for multi-DC production workloads; it gives you consistency without cross-DC latency
- Match your compaction strategy to your workload: STCS (write-heavy), LCS (read-heavy), TWCS (TTL/time-series)
- Never neglect repair — run incremental repairs daily and full repairs weekly, aligned to gc_grace_seconds
- Keep JVM heap at 8GB max with G1GC; monitor GC pauses and dropped mutations as your primary health signals
- For multi-DC, use NetworkTopologyStrategy with RF=3 per DC and plan for graceful failover with LOCAL_QUORUM
- Test backups by actually restoring them in staging — snapshots + incremental, weekly + daily cadence
Cassandra Complete — Series Navigation
References
Apache Cassandra Official Documentation · Nodetool Reference · Cassandra Reaper (Repairs)