disagg: Fix unexpected object storage usage caused by pre-lock residue (#10760)#10767
Conversation
Signed-off-by: ti-chi-bot <[email protected]>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
This cherry pick PR is for a release branch and has not yet been approved by triage owners. To merge this cherry pick:
DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. 🗂️ Base branches to auto review (3)
Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
This is an automated cherry-pick of #10760
What problem does this PR solve?
Issue Number: close #10763
Problem Summary:
PageDirectorywrite-group semantics could cause follower writers to miss their own applied lock-id cleanup signals.S3LockLocalManager.pre_lock_keyscould remain resident and be repeatedly written into manifest locks.What is changed and how it works?
End-to-end correctness fixes for lock lifecycle
PageDirectory::applynow returns writer-scopedapplied_data_filesfor both write-group owner and followers, so each writer gets its own cleanup signal.UniversalPageStorage::writeuses those per-writer ids to clean pre-locks reliably after apply.cleanPreLockKeysOnWriteFailure(...)is invoked when remote write/apply fails.createS3LockForWriteBatchwas adjusted to avoid partial pre-lock residue on partial lock-creation failures (append topre_lock_keysafter lock-creation pass), and its return value is now aligned with "newly appended keys" semantics.Test coverage and regression guards
PageDirectoryandUniversalPageStoragepaths.S3LockLocalManagertests for partial cleanup, failure cleanup, lock-return semantics, and partial-failure atomicity.std::launch::asyncto avoid deferred scheduling risk.Observability and operations improvements
S3GCManagerService.tiflash_storage_s3_store_summary_bytes{store_id, type=data_file_bytes|dt_file_bytes}remote_summary_interval_secondsand wired it throughTMTContext;<= 0disables periodic summary task registration.Check List
Tests
Side effects
Documentation
Release note
Summary by CodeRabbit
Bug Fixes
New Features
Improvements