expected f-statistics #227

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

cecileane merged 20 commits into master from f2

May 14, 2025

Member

cecileane commented Apr 24, 2025 •

edited

Loading

new and exported:
- expectedf2matrix, expectedf4table
- tablequartetdata: code from tablequartetCF, which now calls tablequartetdata(..., prefix="CF")
new but not exported:
- descendenceweight (basis to calculate Ω then f2), expectedf3matrix, check_valid_gammas
- hammingdistancematrix and distancecorrection_JC!, whose code was extracted from startingBL!
new manual page to explain these f-statistics and previously available functions pairwisetaxondistancematrix and countquartetsintrees.

Problem to be discussed: For quartet concordance factors, there is more symmetry than for f4-statistics: CF(12|34) = CF(12|43), yet f4(12|34) = - f4(12|43). For this reason, the 3 f4s are listed for the following taxon orders: 12_34, 13_42, 14_23, and then these 3 values sum up to 0.
Yet, the default column names used by tablequartetCF are: 12_34, 13_24, 14_23 (note _24 instead of _42 in the second).

For backward-compatibility, we should keep these original column names for CFs.
But for f4 statistics, it would be best to make them match the actual order of taxa in the f4 statistics by default, without relying on the user to provide their own names (as shown in the documentation).
I'm not sure what's the best way to do this.

cecileane added 17 commits

April 16, 2025 17:44


          descendenceweight

cc61e54


          replaced links to wiki by links to PhyloUtilities

5b55407


          test descendenceweight

cb1f49f


          f2: start

1b0a19a


          weights for f2: start

852f04c


          test of f2

5fbc793


          start: doc page for pairwise and quartet data

5b1e858


          doc on f2 and dist

28aef4f


          doc manual on f2

543bd6e


          modularized startingBL!

d366100


          f4: start, wish list

9ef23e0


          f4 coded, not tested yet

1a57e43


          new: tablequartetdata. not tested yet

7ed4695


          small fix

3e2cba1


          export: expectedf2matrix, expectedf4table, tablequartetdata

cac7aa9


          can't get interlinks; f3 added but not tested yet

690eae8


          tests for f3 and f4

5e928a6

cecileane requested a review from pbastide

April 24, 2025 15:15

codecov bot commented Apr 24, 2025 •

edited

Loading

Codecov Report

All modified and coverable lines are covered by tests ✅

Files with missing lines	Coverage Δ
src/auxiliary.jl	`83.31% <100.00%> (+0.47%)`	⬆️
src/expectedfstat.jl	`100.00% <100.00%> (ø)`
src/pairwiseDistanceLS.jl	`95.40% <100.00%> (+0.22%)`	⬆️
src/quartets.jl	`98.31% <100.00%> (+0.02%)`	⬆️
src/recursion_matrices.jl	`89.52% <100.00%> (+0.49%)`	⬆️
src/types.jl	`94.91% <ø> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.


          Review docs

60f212d

Member

pbastide commented May 2, 2025

Thank you for implementing this !

It all looks good to me. I made some suggestions on the documentation.

About the quartet names, I agree that relying on the user to provide the right one is not sustainable.
Maybe one of these two solutions might work:

create a tablequartetf4, that would just call tablequartetdata with the right names
add an optional colnames field to the QuartetT{T} type. The default value from the constructor could be ["12_34","13_24","14_23"], but it could be specified when creating a new object. Then quartetdata_columnnames could use this field to get the correct names.

I think I like the second solution better. When we create a QuartetT{T}, we should know how we fill it exactly, so what should be the names. It is also more expandable: if we need to create an other type of Quartet object, we can choose the names directly. Note that we could also have an optional prefix field to the object, default to "CF", so that we would just need to call tablequartetdata on the object, and the correct names and prefix are used automatically.

The downside is that it requires a new field (I don't know if that is expensive ?), while the first solution might be easier to implement.

Member Author

cecileane commented May 2, 2025

The second solution would be most robust indeed! The main issue, I think, is memory. The number of 4-taxon sets explodes quickly, for example with 1365 QuartetT{T} objects on only 15 taxa. If the vector or tuple ["12_34","13_24","14_23"] is stored independently 1365 times, that seems like a waste of memory.

Perhaps we could have QuartetTS{T,S} type with 2 parameters: T for the basic data type, and S might be a new type, for which we define a function columnnames(::S). For now, we would need one for concordance factors, and one for f4 statistics.

But this approach still requires some kind of memory to know that the object has type S, so I'm not sure it's worth the complication, compared to a simple function tablequartetf4 in your first solution.

Member

pbastide commented May 2, 2025

Ah yes, good point about the waste of memory for vectors !

An other solution could be to create a new QuartetTVector structure, that would just be a vector of QuartetT, plus the colnames field (or an other parameter S). That way this information is only stored once for the whole vector. But that might be a breaking change (changing all the Vector{QuartetT{T}} to QuartetTVector objects), I'm not sure it would be worth the trouble.

Maybe tablequartetf4 is a good compromise then, unless we can think of something better.

cecileane added 2 commits

May 2, 2025 15:22


          small fixes in doc

926c735


          tablequartetf4: new, exported. tablequartedata: generic, un-exported

4e38fa0

Member Author

cecileane commented May 2, 2025

The last commit uses strategy 1: with a new function tablequartetf4. I propose not exporting the generic tablequartetdata. quartetdata_columnnames was made more flexible.

pbastide approved these changes

View reviewed changes

Member

pbastide commented May 14, 2025

I agree with exporting tablequartetf4 but not exporting the generic tablequartetdata.

I like this solution, it's simple and adds no memory burden. We'll just need to remember to possibly create a new fonction if we add a new time of quartet data structure in addition to CF and f4.

Thanks !

cecileane merged commit 0d687b6 into master

6 checks passed

cecileane deleted the f2 branch

May 14, 2025 15:25

cecileane added a commit that referenced this pull request


          Merge PR #227 into mu-representation

7d38c6d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet