Skip to content

ZOOKEEPER-5051: Avoid NPE closing 4lw connection during startup#2391

Open
PAC-MAJ wants to merge 1 commit into
apache:masterfrom
PAC-MAJ:ZOOKEEPER-5051-4lw-startup-remove-cnxn-npe
Open

ZOOKEEPER-5051: Avoid NPE closing 4lw connection during startup#2391
PAC-MAJ wants to merge 1 commit into
apache:masterfrom
PAC-MAJ:ZOOKEEPER-5051-4lw-startup-remove-cnxn-npe

Conversation

@PAC-MAJ
Copy link
Copy Markdown

@PAC-MAJ PAC-MAJ commented May 20, 2026

Problem

When a 4-letter command is sent while the ZooKeeper server is starting,
the connection close path can call ZooKeeperServer.removeCnxn() before
startdata() has initialized zkDb.

This can trigger:

java.lang.NullPointerException: Cannot invoke
"org.apache.zookeeper.server.ZKDatabase.removeCnxn(...)"
because "this.zkDb" is null

But this also hangs any client waiting for connection close.

Fix

Guard ZooKeeperServer.removeCnxn() so it only delegates to zkDb
when zkDb has already been initialized.

Test

Added ZooKeeperServerTest#testRemoveCnxnBeforeStartData.
The test creates a ZooKeeperServer but intentionally does not call
startdata(), then verifies that removeCnxn() does not throw.

@PAC-MAJ PAC-MAJ force-pushed the ZOOKEEPER-5051-4lw-startup-remove-cnxn-npe branch 3 times, most recently from 6f17b3e to fa9f51d Compare May 20, 2026 13:21
@PAC-MAJ PAC-MAJ force-pushed the ZOOKEEPER-5051-4lw-startup-remove-cnxn-npe branch 4 times, most recently from 8ace05d to 89dae72 Compare May 25, 2026 19:36
@PAC-MAJ
Copy link
Copy Markdown
Author

PAC-MAJ commented Jun 1, 2026

Hi ZooKeeper maintainers,

Gentle ping on this PR for ZOOKEEPER-5051.

This fixes a startup race where a 4lw command connection can be closed before ZooKeeperServer.startdata() initializes zkDb, causing ZooKeeperServer.removeCnxn() to throw an NPE and potentially leave the client connection hanging.

The change is intentionally small: guard removeCnxn() when zkDb is not initialized yet. I also added a deterministic regression test for the pre-startdata case.

The previous precommit failure looked like Jenkins infrastructure/agent failure during result collection rather than a test failure. Could someone please rerun CI or advise if anything else is needed from my side?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant