Dear Phil,
Comparing R-factors in this case does not tell that one refinement is better or worse than the other one. It just doesn't tell anything because the R-factor is not a good measure when you deal with two different datasets (datasets containing different amount of reflections).
This would mean that the whole thing is inherently untestable because of phenix.refine's rejection criteria - there will always be a difference in data count because of that.
I guess this paper nicely explains this: Interpretation of ensembles created by multiple iterative rebuilding of macromolecular models. T. C. Terwilliger, R. W. Grosse-Kunstleve, P. V. Afonine, P. D. Adams, N. W. Moriarty, P. H. Zwart, R. J. Read, D. Turk and L.-W. Hung Acta Cryst. D63, 597-610 (2007).
Propose a better experiment.
Just a stream of thought (you can tune this up): Do two complete structure solutions in parallel: 1) using the dataset containing Fobs=0 and 2) using dataset with Fobs>0. Given the above paper, you would probably need to build an ensemble of models in each experiment. Then find the differences between two results and demonstrate that these differences are "important" (or, saying differently, analyze these differences and may be you will find them "important"). Yes, you will need to define what is "important": tiny gain in R-factor (or making sure it stays the same at least) or revealing some new structural details, building a more complete model, resolving unclear densities and so on. You will probably need to consider doing this for a bunch of structures (and not just for your favorite one), at different resolutions. If you are lucky, you may run into the case where the building/refinement process gets stuck if you remove Fobs=0 and you magically unstuck it by including those Fobs=0. And so on, and so on. A nice little project for someone who has some extra time to spend. You may even publish it then: "On impact of weak reflections (Fobs=0) in structure solution and final model quality". But, I would not do this myself. Because I know it is good to use all data, at least it will not harm. And I know it will be done, and phenix.refine will use these data. The only thing is the priority: since I don't know how important it is (so far no-one convinced me that I have to rush doing it right now) I will not jump into doing it today, but rather would keep doing more pressing things. But sometime in future it will be there. All the best! Pavel.