-
Notifications
You must be signed in to change notification settings - Fork 256
Fix: 修復自定義詞典中帶有連字符的詞彙被拆分的問題 #102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
問題描述: - 當使用自定義詞典添加包含連字符的詞彙(如 BZ-YQ1722)時 - 無論頻率設置為多少,詞彙總是被拆分(如 BZ 和 YQ1722) 根本原因: - 正則表達式模式不包含連字符 `-` - 文本分塊階段就已經將詞彙分開,導致後續處理無法識別完整詞彙 修復內容: - 在 re_han_with_ascii_pattern 添加 \- 支持連字符 - 在 re_ascii_pattern 添加 \- 支持連字符 - 在 re_skip_pattern (cut_all 模式) 添加 \- 支持連字符 - 新增 HyphenWordsTest.php 測試用例驗證修復 影響範圍: - 現在可以正確處理包含連字符的自定義詞彙 - 同時支持 addWord() 和 loadUserDict() 兩種方式 - 不影響現有功能 Fixes #99 🤖 Generated with [Claude Code](https://claude.ai/code) Co-authored-by: Fukuball Lin <fukuball@users.noreply.github.com>
Code Review - PR #102: Fix Hyphenated Words SegmentationSummaryThis PR addresses issue #99 by adding hyphen support to regex patterns in ✅ Positive Aspects1. Excellent Test Coverage
2. Correct Root Cause Analysis 3. Clean Implementation
|
|
@claude 測試沒跑過,你有辦法直接修嗎? |
|
Claude finished @fukuball's task —— View job 修復測試失敗問題 ✅已成功修復測試失敗的問題並推送修復! 任務清單
測試失敗原因測試 完成的修復在以下三個位置的正則表達式模式中添加了連字符
技術細節之前的 PR 已經在 這次修復確保了在所有分詞路徑中,連字符都被識別為有效的詞彙組成字符,使得像 提交信息CI 測試現在應該會通過了!你可以在 Actions 頁面 查看新的測試結果。 |
在 Posseg.php (兩處) 和 Finalseg.php 的正則表達式模式中添加連字符 \- 支持, 確保包含連字符的詞彙(如 BZ-YQ1722)在 POS 標註和 HMM 分詞時不會被拆分。 - Posseg.php:565 - 添加 \- 到 re_skip_pattern - Posseg.php:801 - 添加 \- 到 re_skip_pattern(保持 &= 字符) - Finalseg.php:295 - 添加 \- 到 re_skip_pattern 修復測試 HyphenWordsTest::testHyphenatedWordWithPosTag 失敗問題。 Co-authored-by: Fukuball Lin <fukuball@users.noreply.github.com>
Pull Request Review: Fix Hyphenated Words SegmentationSummaryThis PR successfully addresses issue #99 by adding hyphen support to regex patterns used for text segmentation. The fix is targeted, well-tested, and correctly identifies the root cause. ✅ Strengths1. Correct Root Cause AnalysisThe PR correctly identified that the regex patterns were splitting hyphenated words during the text chunking phase, before dictionary lookup could occur. This is the right level to fix the issue. 2. Comprehensive Test CoverageThe new
3. Consistent Pattern UpdatesThe fix correctly updates all relevant regex patterns across three classes:
4. Follows Repository Conventions
|
|
@claude 我看 CI 測試還是有出錯,你再檢查一下,幫忙修好 |
|
Claude finished @fukuball's task —— View job 修復 CI 測試失敗 ✅任務清單
問題分析測試
根本原因:在 修復內容在
修改邏輯:在應用預設標籤之前,先檢查 提交信息修復已推送!CI 測試現在應該會通過了。您可以在 Actions 頁面 查看新的測試結果。 |
在 Posseg::__cutDetail() 和 Posseg::cut() 方法中, 當處理匹配 re_skip_pattern 的詞彙時,現在會先檢查 是否存在自定義標籤,然後才應用預設的模式標籤(數字或英文)。 這確保了使用 Jieba::addWord() 添加的自定義標籤不會被 預設標籤覆蓋。 Co-authored-by: Fukuball Lin <fukuball@users.noreply.github.com>
Pull Request Review: Fix Hyphenated Words in Custom DictionarySummaryThis PR successfully addresses issue #99 by adding hyphen support to regex patterns across the codebase, ensuring words like Code Quality & Best Practices ✅Strengths:
Potential Issues & Recommendations
|
問題描述
當使用自定義詞典添加包含連字符的詞彙(如 BZ-YQ1722)時,無論頻率設置為多少,詞彙總是被拆分。
根本原因
正則表達式模式不包含連字符
-,導致文本分塊階段就已經將詞彙分開。修復內容
re_han_with_ascii_pattern添加\-支持連字符re_ascii_pattern添加\-支持連字符re_skip_pattern(cut_all 模式) 添加\-支持連字符HyphenWordsTest.php測試用例驗證修復Fixes #99
🤖 Generated with Claude Code