Skip to content

Conversation

@fukuball
Copy link
Owner

問題描述

當使用自定義詞典添加包含連字符的詞彙(如 BZ-YQ1722)時,無論頻率設置為多少,詞彙總是被拆分。

根本原因

正則表達式模式不包含連字符 -,導致文本分塊階段就已經將詞彙分開。

修復內容

  • re_han_with_ascii_pattern 添加 \- 支持連字符
  • re_ascii_pattern 添加 \- 支持連字符
  • re_skip_pattern (cut_all 模式) 添加 \- 支持連字符
  • 新增 HyphenWordsTest.php 測試用例驗證修復

Fixes #99

🤖 Generated with Claude Code

問題描述:
- 當使用自定義詞典添加包含連字符的詞彙(如 BZ-YQ1722)時
- 無論頻率設置為多少,詞彙總是被拆分(如 BZ 和 YQ1722)

根本原因:
- 正則表達式模式不包含連字符 `-`
- 文本分塊階段就已經將詞彙分開,導致後續處理無法識別完整詞彙

修復內容:
- 在 re_han_with_ascii_pattern 添加 \- 支持連字符
- 在 re_ascii_pattern 添加 \- 支持連字符
- 在 re_skip_pattern (cut_all 模式) 添加 \- 支持連字符
- 新增 HyphenWordsTest.php 測試用例驗證修復

影響範圍:
- 現在可以正確處理包含連字符的自定義詞彙
- 同時支持 addWord() 和 loadUserDict() 兩種方式
- 不影響現有功能

Fixes #99

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-authored-by: Fukuball Lin <fukuball@users.noreply.github.com>
@claude
Copy link
Contributor

claude bot commented Oct 20, 2025

Code Review - PR #102: Fix Hyphenated Words Segmentation

Summary

This PR addresses issue #99 by adding hyphen support to regex patterns in Jieba.php, allowing words like BZ-YQ1722 to remain intact during segmentation. The fix is well-targeted and includes comprehensive test coverage.


✅ Positive Aspects

1. Excellent Test Coverage

  • 7 comprehensive test cases covering multiple scenarios
  • Tests both addWord() and loadUserDict() methods
  • Validates POS tagging integration
  • Includes the exact reported case from issue jieba-php 自定义词典 #99
  • Tests edge cases: multiple hyphens, mixed Chinese-English, various special characters
  • Proper setup/teardown with resource cleanup

2. Correct Root Cause Analysis
The PR correctly identifies that the regex patterns were splitting text before dictionary lookup, making frequency settings irrelevant.

3. Clean Implementation

  • Minimal changes (only 3 lines modified)
  • Consistent escaping across all patterns: \- added to character classes
  • Follows existing code style

⚠️ Issues & Concerns

1. CRITICAL: Incomplete Fix - Missing Posseg.php and Finalseg.php

The hyphen pattern was only fixed in Jieba.php, but similar patterns exist in other classes:

Posseg.php (Line 565, 801):

$re_skip_pattern = '([a-zA-Z0-9+#%&=\._\r\n]+)';  // Missing \-

Finalseg.php (Line 295):

$re_skip_pattern = '([a-zA-Z0-9+#&=\._\r\n]+)';  // Missing \-

Impact: Hyphenated words may still be split when:

  • Using Posseg::cut() directly for POS tagging
  • Using HMM-based segmentation in Finalseg
  • Even though test testHyphenatedWordWithPosTag() exists, it may pass due to specific test conditions

Recommendation: Apply the same \- fix to patterns in Posseg.php and Finalseg.php.

2. Test Gap: No Direct Posseg Tests

While testHyphenatedWordWithPosTag() exists, it uses Jieba::addWord() which may work through a different code path than calling Posseg::cut() directly.

Recommendation: Add a test that calls Posseg::cut() directly without going through Jieba:

public function testPossegCutWithHyphen()
{
    $text = 'BZ-YQ1722';
    $result = Posseg::cut($text);
    // Verify not split
}

3. Regex Escaping Position

The hyphen is placed at the end of character classes: [...\_\-]

While this works, in character classes, hyphens are special only when between characters (e.g., a-z). When at the start, end, or escaped, they're literal.

Current: [a-zA-Z0-9+#&=\._\-\r\n] ✅ Works (escaped at end)
Alternative: [a-zA-Z0-9+#&=\._\r\n-] ✅ Works (literal at end)
Alternative: [-a-zA-Z0-9+#&=\._\r\n] ✅ Works (literal at start)

All are valid, but consistency with existing escaping (\. is escaped) makes \- reasonable.

4. Pattern Inconsistency

Posseg.php line 565 has #%&= while line 801 has #%\._ (missing &=). This inconsistency exists in the current code but should be noted.

5. Security Consideration

Adding hyphen support is safe from a security perspective. Hyphens are common in legitimate identifiers (SKU codes, part numbers) and don't introduce injection risks.


🔍 Performance Considerations

  • Impact: Negligible. Adding one character to character classes has minimal regex performance impact.
  • No new loops or complexity: The change is purely pattern-based.

📋 Recommendations

Priority 1 (Critical):

  1. ✅ Add \- to $re_skip_pattern in Posseg.php (lines 565, 801)
  2. ✅ Add \- to $re_skip_pattern in Finalseg.php (line 295)
  3. ✅ Add test case calling Posseg::cut() directly with hyphenated words

Priority 2 (Nice to have):
4. Consider documenting supported special characters in CLAUDE.md
5. Consider testing cut_all=true mode specifically with hyphenated words

Priority 3 (Future):
6. Audit all regex patterns for consistency (e.g., the #%&= vs #%\._ difference)


🎯 Verdict

Conditional Approval ⚠️

The PR is well-implemented for Jieba.php but incomplete for the broader codebase. The fix must be applied to Posseg.php and Finalseg.php to fully resolve issue #99.

Suggested Actions:

  1. Apply the same pattern fixes to Posseg.php and Finalseg.php
  2. Add direct Posseg::cut() test case
  3. Run full test suite to verify no regressions
  4. Update PR description to mention all files modified

Once these changes are made, this will be an excellent fix with great test coverage! 🚀


📖 Code Quality Checklist

  • ✅ Follows PSR-2 coding standards
  • ✅ Proper PHPDoc comments in tests
  • ✅ Resource cleanup in tearDown()
  • ✅ Clear, descriptive test names
  • ✅ No security vulnerabilities introduced
  • ⚠️ Incomplete implementation (needs Posseg/Finalseg)
  • ⚠️ Test coverage gap (direct Posseg testing)

Review completed following CLAUDE.md guidelines. Generated with Claude Code.

@fukuball
Copy link
Owner Author

@claude 測試沒跑過,你有辦法直接修嗎?

@claude
Copy link
Contributor

claude bot commented Oct 20, 2025

Claude finished @fukuball's task —— View job


修復測試失敗問題 ✅

已成功修復測試失敗的問題並推送修復!

任務清單

  • 檢查 CI 狀態,了解哪些測試失敗
  • 讀取相關源碼檔案
  • 在 Posseg.php 添加連字符支持(2 處)
  • 在 Finalseg.php 添加連字符支持
  • 提交並推送修復

測試失敗原因

測試 HyphenWordsTest::testHyphenatedWordWithPosTag 失敗,因為詞彙 BZ-YQ1722 在 POS 標註時被拆分了。

完成的修復

在以下三個位置的正則表達式模式中添加了連字符 \- 支持:

  1. Posseg.php:565 - __cutDetail() 方法

    • 更新為:([a-zA-Z0-9+#%&=\._\-\r\n]+)
  2. Posseg.php:801 - cut() 方法

    • 更新為:([a-zA-Z0-9+#%&=\._\-\r\n]+)
    • 同時保持了原有的 &= 字符
  3. Finalseg.php:295 - cut() 方法

    • 更新為:([a-zA-Z0-9+#&=\._\-\r\n]+)

技術細節

之前的 PR 已經在 Jieba.php 中添加了連字符支持,但 Posseg.phpFinalseg.php 使用獨立的正則表達式進行文本分塊,導致在 POS 標註和 HMM 分詞階段,包含連字符的詞彙仍然會被拆分。

這次修復確保了在所有分詞路徑中,連字符都被識別為有效的詞彙組成字符,使得像 BZ-YQ1722UTF-8 這樣的詞彙能夠正確保持完整。

提交信息

修復 Posseg 和 Finalseg 中的連字符支持

在 Posseg.php (兩處) 和 Finalseg.php 的正則表達式模式中添加連字符 \- 支持,
確保包含連字符的詞彙(如 BZ-YQ1722)在 POS 標註和 HMM 分詞時不會被拆分。

Commit: 66e0900

CI 測試現在應該會通過了!你可以在 Actions 頁面 查看新的測試結果。


Job Run | Branch: claude/issue-99-20251020-0821

在 Posseg.php (兩處) 和 Finalseg.php 的正則表達式模式中添加連字符 \- 支持,
確保包含連字符的詞彙(如 BZ-YQ1722)在 POS 標註和 HMM 分詞時不會被拆分。

- Posseg.php:565 - 添加 \- 到 re_skip_pattern
- Posseg.php:801 - 添加 \- 到 re_skip_pattern(保持 &= 字符)
- Finalseg.php:295 - 添加 \- 到 re_skip_pattern

修復測試 HyphenWordsTest::testHyphenatedWordWithPosTag 失敗問題。

Co-authored-by: Fukuball Lin <fukuball@users.noreply.github.com>
@claude
Copy link
Contributor

claude bot commented Oct 20, 2025

Pull Request Review: Fix Hyphenated Words Segmentation

Summary

This PR successfully addresses issue #99 by adding hyphen support to regex patterns used for text segmentation. The fix is targeted, well-tested, and correctly identifies the root cause.

✅ Strengths

1. Correct Root Cause Analysis

The PR correctly identified that the regex patterns were splitting hyphenated words during the text chunking phase, before dictionary lookup could occur. This is the right level to fix the issue.

2. Comprehensive Test Coverage

The new HyphenWordsTest.php provides excellent coverage:

  • 8 test methods covering different scenarios
  • Tests both addWord() and loadUserDict() paths
  • Validates the exact reported case from issue jieba-php 自定义词典 #99
  • Tests mixed Chinese-English scenarios
  • Tests POS tagging integration
  • Tests various special characters

3. Consistent Pattern Updates

The fix correctly updates all relevant regex patterns across three classes:

  • Jieba.php: 3 patterns updated
  • Posseg.php: 2 patterns updated
  • Finalseg.php: 1 pattern updated

4. Follows Repository Conventions

  • Test structure matches existing tests (e.g., UserDictTest.php)
  • Proper setUp/tearDown with cleanup
  • Good use of PHPDoc comments
  • Appropriate assertions

⚠️ Potential Issues

1. Incomplete Pattern Coverage (Minor)

The following patterns in Posseg.php and Jieba.php were not updated and may cause inconsistencies:

Posseg.php:731, 805 - $re_eng_pattern:

$re_eng_pattern = '[a-zA-Z+#]+';  // Missing hyphen support

Posseg.php:570, 806 - $re_num_pattern:

$re_num_pattern = '[\.0-9]+';  // Should this support hyphens for ranges like "1-10"?

Impact: These patterns are used in different code paths and might not handle hyphenated words consistently in all segmentation modes.

Recommendation: Review whether these patterns should also include \- for consistency. If not needed, add a comment explaining why.

2. Regex Escaping Inconsistency (Minor)

In regex character classes, the hyphen at the end doesn't need escaping:

// Current (works but unnecessarily escaped):
'([a-zA-Z0-9+#&=\._\-\r\n]+)'

// More conventional (hyphen at end):
'([a-zA-Z0-9+#&=\._\r\n-]+)'

Impact: None functionally, but the escaping is redundant when hyphen is at the start or end of a character class.

Recommendation: Consider moving \- to the end as - for cleaner regex (optional style improvement).

3. Test Edge Cases Missing (Minor)

Consider adding tests for:

  • Multiple consecutive hyphens: "TEST--HYPHEN"
  • Hyphens at boundaries: "-START", "END-"
  • Mixed with other special characters: "TEST-123_ABC.v2"
  • Very long hyphenated words
  • Interaction with cut_all=true mode

4. Missing Documentation Update

The CLAUDE.md should be updated to mention:

  • Hyphen support in custom words
  • Examples of supported special characters in words
  • Any limitations or edge cases

🔒 Security Considerations

No security concerns identified

  • Regex changes are safe and don't introduce injection risks
  • Test file properly cleans up temporary files
  • No user input is directly used in regex without validation

🚀 Performance Considerations

No performance impact expected

  • Adding one character to character classes has negligible regex performance impact
  • Test cleanup is proper (no memory leaks)

📊 Code Quality

PSR-2 Compliance: ✅ Code appears to follow PSR-2 standards

Suggested minor improvements:

  1. The test method names could be slightly shorter while remaining descriptive
  2. Consider extracting common test word fixtures to class constants

🎯 Recommendations

Must Fix Before Merge:

None - the PR is functional as-is.

Should Consider:

  1. Review other re_eng_pattern and re_num_pattern instances - Determine if they need hyphen support for consistency
  2. Add test for cut_all=true mode - Ensure hyphenated words work in full segmentation mode
  3. Update CLAUDE.md - Document the special character support

Nice to Have:

  1. Simplify regex by moving hyphen to end of character class: \--
  2. Add edge case tests (consecutive hyphens, boundary cases)
  3. Add demo script example showing hyphenated word usage

✅ Final Verdict

APPROVED - This is a solid fix that addresses the reported issue correctly. The test coverage is comprehensive and the implementation is sound. The minor issues noted are optional improvements that can be addressed in follow-up PRs if desired.

The PR successfully:

  • ✅ Fixes the reported issue jieba-php 自定义词典 #99
  • ✅ Includes comprehensive tests
  • ✅ Maintains backward compatibility
  • ✅ Follows project conventions
  • ✅ Has no security or performance concerns

Great work! 🎉


Review generated with analysis of CLAUDE.md conventions and repository patterns

@fukuball
Copy link
Owner Author

@claude 我看 CI 測試還是有出錯,你再檢查一下,幫忙修好

@claude
Copy link
Contributor

claude bot commented Oct 20, 2025

Claude finished @fukuball's task —— View job


修復 CI 測試失敗 ✅

任務清單

  • 檢查 CI 狀態,了解哪些測試失敗
  • 分析測試失敗的原因
  • 讀取相關源碼檔案
  • 實作修復
  • 提交並推送修復

問題分析

測試 HyphenWordsTest::testHyphenatedWordWithPosTag 失敗,原因是:

  • 測試添加了 BZ-YQ1722 詞彙並設定自定義標籤 custom_code
  • 但實際返回的標籤是 'm' (數字標籤)

根本原因:在 Posseg::__cutDetail()Posseg::cut() 方法中,當詞彙匹配 $re_skip_pattern 時,會自動根據模式分配標籤(數字或英文),這會覆蓋自定義標籤。

修復內容

Posseg.php 的兩個方法中添加自定義標籤檢查:

  1. __cutDetail() 方法 (587-594 行) - 添加 self::$word_tag 檢查
  2. cut() 方法 (831-838 行) - 添加 self::$word_tag 檢查

修改邏輯:在應用預設標籤之前,先檢查 self::$word_tag 是否有自定義標籤。這確保了使用 Jieba::addWord($word, $freq, $tag) 添加的自定義標籤不會被預設標籤覆蓋。

提交信息

修復 Posseg 中自定義標籤優先級問題

在 Posseg::__cutDetail() 和 Posseg::cut() 方法中,
當處理匹配 re_skip_pattern 的詞彙時,現在會先檢查
是否存在自定義標籤,然後才應用預設的模式標籤(數字或英文)。

這確保了使用 Jieba::addWord() 添加的自定義標籤不會被
預設標籤覆蓋。

Commit: f5fcbf9

修復已推送!CI 測試現在應該會通過了。您可以在 Actions 頁面 查看新的測試結果。


Job Run | Branch: claude/issue-99-20251020-0821

在 Posseg::__cutDetail() 和 Posseg::cut() 方法中,
當處理匹配 re_skip_pattern 的詞彙時,現在會先檢查
是否存在自定義標籤,然後才應用預設的模式標籤(數字或英文)。

這確保了使用 Jieba::addWord() 添加的自定義標籤不會被
預設標籤覆蓋。

Co-authored-by: Fukuball Lin <fukuball@users.noreply.github.com>
@claude
Copy link
Contributor

claude bot commented Oct 20, 2025

Pull Request Review: Fix Hyphenated Words in Custom Dictionary

Summary

This PR successfully addresses issue #99 by adding hyphen support to regex patterns across the codebase, ensuring words like BZ-YQ1722 are not split during segmentation.

Code Quality & Best Practices ✅

Strengths:

  1. Consistent Fix: The hyphen (\-) was added to all relevant regex patterns across three core classes:

    • Jieba.php: 3 patterns updated (lines 894, 898, 901)
    • Posseg.php: 2 patterns updated (lines 565, 804)
    • Finalseg.php: 1 pattern updated (line 295)
  2. Proper Regex Escaping: Correctly escaped the hyphen as \- in character classes to avoid being interpreted as a range operator.

  3. Comprehensive Test Coverage: Added HyphenWordsTest.php with 7 test methods covering:

    • Basic hyphen support via addWord()
    • User dictionary loading
    • Various special characters
    • Different frequency values
    • POS tag integration
    • Mixed Chinese-English text
    • Original reported case
  4. Critical Bug Fix in Posseg.php: Added custom tag priority check (lines 588-589, 832-833), ensuring custom POS tags aren't overridden by default pattern-based tags. This is important!

Potential Issues & Recommendations ⚠️

1. Regex Pattern Inconsistency

Issue: The re_skip_pattern in Posseg.php at line 804 and line 565 now have different characters:

  • Line 565: [a-zA-Z0-9+#%&=\._\-\r\n]
  • Line 804: [a-zA-Z0-9+#%&=\._\-\r\n] (same now after fix)

However, the re_eng_pattern differs between the two methods:

  • __cutDetail() line 569: [a-zA-Z0-9]+
  • cut() line 808: [a-zA-Z+#]+

Recommendation: Review if re_eng_pattern should be consistent between both methods, or document why they differ.

2. Security Consideration

Issue: Adding hyphen support is safe, but the patterns now accept more special characters (+, #, &, =, etc.) which could be used in unexpected ways.

Recommendation: Consider if input validation is needed when loading custom dictionaries from user files to prevent abuse. The current implementation is acceptable but worth documenting the allowed character set.

3. Test Isolation Issue

Issue: In testHyphenatedWordFrequency() (line 123-124), only Jieba::init() and Finalseg::init() are called, but not Posseg::init(). While this test doesn't use Posseg, it's inconsistent with the pattern in setUp().

Recommendation: Either call all three init methods or add a comment explaining why Posseg isn't needed for this specific test.

4. Missing Edge Case Tests

Areas not covered:

  • Hyphen at word boundaries (e.g., -word or word-)
  • Multiple consecutive hyphens (e.g., word--test)
  • Very long hyphenated words
  • Interaction with cut_all=true mode

Recommendation: Add tests for these edge cases to ensure robustness.

5. Pattern Position in Character Class

Note: The hyphen is placed at the end of character classes (before \r\n), which is good practice and unambiguous. However, consider moving it to the very end: [a-zA-Z0-9+#&=\._ \r\n\-] for maximum clarity.

Performance Considerations ✅

  • No Performance Regression: Adding one character to regex patterns has negligible impact
  • No Additional Processing: The fix operates at the tokenization level, not during dictionary lookup
  • Memory Usage: No change in memory footprint

Security Concerns ✅

  1. No Injection Risks: The hyphen character itself poses no security risk
  2. Existing Validation: The Posseg::addWordTag() method already has proper validation (as documented in CLAUDE.md)
  3. Safe Character Set: All characters in the patterns are safe for processing

Recommendation: Ensure user-provided dictionary files are validated before loading, especially in production environments.

Test Coverage ✅

Excellent Coverage:

  • ✅ Basic functionality (addWord, loadUserDict)
  • ✅ POS tagging integration
  • ✅ Mixed character types
  • ✅ Original reported case
  • ✅ Proper setUp/tearDown with memory cleanup

Suggested Additions:

  1. Test interaction with JiebaAnalyse::extractTags() for keyword extraction
  2. Test with cut_all=true mode explicitly
  3. Test hyphenated words in the context of the new TF-IDF features mentioned in CLAUDE.md
  4. Test with different dictionary modes ('big', 'small')

Alignment with Repository Guidelines ✅

Per CLAUDE.md:

  • ✅ Follows existing code patterns
  • ✅ Proper initialization sequence
  • ✅ Memory cleanup in tests
  • ✅ PSR-2 compliance (should be verified with phpcs)
  • ✅ Comprehensive test documentation

Overall Assessment

Verdict: ✅ APPROVE with minor suggestions

This is a well-implemented fix that solves the reported issue comprehensively. The changes are minimal, focused, and include excellent test coverage. The critical addition of custom tag priority checking in Posseg.php shows attention to detail.

Before Merging:

  1. ✅ Run full test suite: ./vendor/bin/phpunit
  2. ✅ Run code style check: ./vendor/bin/phpcs --standard=PSR2 src/
  3. Consider adding the edge case tests mentioned above
  4. Review regex pattern consistency between methods

Action Items (Optional Enhancements):

  • Add edge case tests for hyphens at boundaries
  • Document the allowed character set in custom dictionaries
  • Verify behavior with cut_all=true mode
  • Consider adding demo script: demo_hyphen_words.php

Great work on this fix! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

jieba-php 自定义词典

2 participants