Skip to content

Conversation

@cecileane
Copy link
Member

@cecileane cecileane commented Apr 24, 2025

  • new and exported:
    • expectedf2matrix, expectedf4table
    • tablequartetdata: code from tablequartetCF, which now calls tablequartetdata(..., prefix="CF")
  • new but not exported:
    • descendenceweight (basis to calculate Ω then f2), expectedf3matrix, check_valid_gammas
    • hammingdistancematrix and distancecorrection_JC!, whose code was extracted from startingBL!
  • new manual page to explain these f-statistics and previously available functions pairwisetaxondistancematrix and countquartetsintrees.

Problem to be discussed: For quartet concordance factors, there is more symmetry than for f4-statistics: CF(12|34) = CF(12|43), yet f4(12|34) = - f4(12|43). For this reason, the 3 f4s are listed for the following taxon orders: 12_34, 13_42, 14_23, and then these 3 values sum up to 0.
Yet, the default column names used by tablequartetCF are: 12_34, 13_24, 14_23 (note _24 instead of _42 in the second).

  • For backward-compatibility, we should keep these original column names for CFs.
  • But for f4 statistics, it would be best to make them match the actual order of taxa in the f4 statistics by default, without relying on the user to provide their own names (as shown in the documentation).
    I'm not sure what's the best way to do this.

@cecileane cecileane requested a review from pbastide April 24, 2025 15:15
@codecov
Copy link

codecov bot commented Apr 24, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Files with missing lines Coverage Δ
src/auxiliary.jl 83.31% <100.00%> (+0.47%) ⬆️
src/expectedfstat.jl 100.00% <100.00%> (ø)
src/pairwiseDistanceLS.jl 95.40% <100.00%> (+0.22%) ⬆️
src/quartets.jl 98.31% <100.00%> (+0.02%) ⬆️
src/recursion_matrices.jl 89.52% <100.00%> (+0.49%) ⬆️
src/types.jl 94.91% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@pbastide
Copy link
Member

pbastide commented May 2, 2025

Thank you for implementing this !

It all looks good to me. I made some suggestions on the documentation.

About the quartet names, I agree that relying on the user to provide the right one is not sustainable.
Maybe one of these two solutions might work:

  • create a tablequartetf4, that would just call tablequartetdata with the right names
  • add an optional colnames field to the QuartetT{T} type. The default value from the constructor could be ["12_34","13_24","14_23"], but it could be specified when creating a new object. Then quartetdata_columnnames could use this field to get the correct names.

I think I like the second solution better. When we create a QuartetT{T}, we should know how we fill it exactly, so what should be the names. It is also more expandable: if we need to create an other type of Quartet object, we can choose the names directly. Note that we could also have an optional prefix field to the object, default to "CF", so that we would just need to call tablequartetdata on the object, and the correct names and prefix are used automatically.

The downside is that it requires a new field (I don't know if that is expensive ?), while the first solution might be easier to implement.

@cecileane
Copy link
Member Author

The second solution would be most robust indeed! The main issue, I think, is memory. The number of 4-taxon sets explodes quickly, for example with 1365 QuartetT{T} objects on only 15 taxa. If the vector or tuple ["12_34","13_24","14_23"] is stored independently 1365 times, that seems like a waste of memory.

Perhaps we could have QuartetTS{T,S} type with 2 parameters: T for the basic data type, and S might be a new type, for which we define a function columnnames(::S). For now, we would need one for concordance factors, and one for f4 statistics.

But this approach still requires some kind of memory to know that the object has type S, so I'm not sure it's worth the complication, compared to a simple function tablequartetf4 in your first solution.

@pbastide
Copy link
Member

pbastide commented May 2, 2025

Ah yes, good point about the waste of memory for vectors !

An other solution could be to create a new QuartetTVector structure, that would just be a vector of QuartetT, plus the colnames field (or an other parameter S). That way this information is only stored once for the whole vector. But that might be a breaking change (changing all the Vector{QuartetT{T}} to QuartetTVector objects), I'm not sure it would be worth the trouble.

Maybe tablequartetf4 is a good compromise then, unless we can think of something better.

@cecileane
Copy link
Member Author

The last commit uses strategy 1: with a new function tablequartetf4. I propose not exporting the generic tablequartetdata. quartetdata_columnnames was made more flexible.

@pbastide
Copy link
Member

I agree with exporting tablequartetf4 but not exporting the generic tablequartetdata.

I like this solution, it's simple and adds no memory burden. We'll just need to remember to possibly create a new fonction if we add a new time of quartet data structure in addition to CF and f4.

Thanks !

@cecileane cecileane merged commit 0d687b6 into master May 14, 2025
6 checks passed
@cecileane cecileane deleted the f2 branch May 14, 2025 15:25
cecileane added a commit that referenced this pull request May 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants