Finding NCS in chains from a PDB file with simple_ncs_from_pdb

Contents

Author(s)
Purpose
Usage for searching
Usage for validation
List of all available keywords

Author(s)

simple_ncs_from_pdb : Tom Terwilliger, Youval Dar, Oleg Sobolev
Phil command interpreter: Ralf W. Grosse-Kunstleve
find_domain: Peter Zwart

Purpose

The simple_ncs_from_pdb method identifies NCS in the chains in a PDB file and writes out the NCS operators in forms suitable for phenix.refine, resolve, and the AutoSol and AutoBuild Wizards.
Validate NCS annotations against a model

Usage for searching

How simple_ncs_from_pdb works:

The basic steps that the simple_ncs_from_pdb carries out are:

Remove part of the structure specified in in exclude_selection parameter.
Identify sets of matching segments in chains in the PDB file by sequences (use value of chain_similarity_threshold as cutoff). These are potential NCS-related chains.
Determine which matching segments have overall rms distance (RMSD) within the given tolerance (chain_max_rmsd, typically 2 A)
Remove residues from matching segments if locally spatially misaligned (residue_match_radius)
Remove atoms from matching residues if not in both residues (for example, if a side chain is missing in one of the matching residues)
Check atoms order residues

Additional notes on how simple_ncs_from_pdb works:

The chains matching is done using dynamic programming alignment of residues and atoms. The first pass contains restriction on minimal similarity, set by min_percent (number of matching residues)/(number of residues in longer chain)

From the matching chains list, we remove chain pairs where the matching segments exceed the RMSD limit.

Matching segments are scanned for local residue misalignment. Residues where (max_atom_distance - min_atom_distance) > match_radius are excluded from matching segment. This allow local differences in matching chains.

If matching residues have different number of atoms (For example if one containing the side chain while the other not), only the matching atoms will be included.

Grouping of chains is not performed.

Alternative conformation are excluded from matching segments.

The matching is done by the residues name strings, not by the residue numbers, this allows handling of insertions in PDB file.

The result of the NCS search is combination of NCS related groups and invariant or non-NCS related regions, to the atom level. In every NCS group all copies have the same number of atoms and can be reproduced by applying the applying the appropriate rotation and translation to the master copy.

Examples

When running

phenix.simple_ncs_from_pdb 2h50.pdb

The following files will be produced

NCS operators written in format for phenix.refine

2h50_simple_ncs_from_pdb.phil

NCS operators written in format for the PHENIX Wizards (if write_spec_files=True)

2h50_simple_ncs_from_pdb.ncs_spec
2h50_simple_ncs_from_pdb.resolve

The file that should be used for refinement is 4boz_simple_ncs_from_pdb.phil. This file can also be modified if a particular NCS relations need to be changed. The content of that file is the exact selection sting of the atoms in the NCS groups

The content of 2h50_simple_ncs_from_pdb.ncs is

ncs_group {
  reference = chain 'A'
  selection = chain 'C'
  selection = chain 'E'
  selection = chain 'G'
  selection = chain 'I'
  selection = chain 'K'
  selection = chain 'M'
  selection = chain 'O'
  selection = chain 'Q'
  selection = chain 'S'
  selection = chain 'U'
  selection = chain 'W'
}

Other outputs

To get output in other format:

phenix.simple_ncs_from_pdb 4boz.pdb write_spec_files=True

Simple_ncs_from_pdb will analyze the chains in 4boz.pdb and identify any NCS that exists. For this sample run the following output is produced:

GROUP 1
Summary of NCS group with 2 operators:
ID of chain/residue where these apply: [['A', 'D'], [[[147, 150], [152, 211],
[213, 275], [280, 305], [307, 308]], [[147, 150], [152, 211], [213, 275], [280, 305], [307, 308]]]]
RMSD (A) from chain A:  0.0  0.72
Number of residues matching chain A:[155, 155]

OPERATOR 1
CENTER:   24.4880  -13.3177  -20.1848

ROTA 1:    1.0000    0.0000    0.0000
ROTA 2:    0.0000    1.0000    0.0000
ROTA 3:    0.0000    0.0000    1.0000
TRANS:     0.0000    0.0000    0.0000

OPERATOR 2
CENTER:   15.9430   11.8822    0.6609

ROTA 1:    0.7964   -0.5651    0.2152
ROTA 2:   -0.5503   -0.8249   -0.1295
ROTA 3:    0.2507   -0.0152   -0.9680
TRANS:    18.3631    5.3425  -23.3604


GROUP 2
Summary of NCS group with 3 operators:
ID of chain/residue where these apply: [['B', 'C', 'E'], [[[1, 41], [43, 71]],
[[1, 41], [43, 71]], [[1, 41], [43, 71]]]]
RMSD (A) from chain B:  0.0  0.8  0.77
Number of residues matching chain B:[70, 70, 70]

OPERATOR 1
CENTER:   47.5581   -9.5652  -26.8403

ROTA 1:    1.0000    0.0000    0.0000
ROTA 2:    0.0000    1.0000    0.0000
ROTA 3:    0.0000    0.0000    1.0000
TRANS:     0.0000    0.0000    0.0000

OPERATOR 2
CENTER:   27.7410  -11.0237  -49.7361

ROTA 1:   -0.6661    0.2615   -0.6986
ROTA 2:    0.2866   -0.7749   -0.5634
ROTA 3:   -0.6886   -0.5755    0.4412
TRANS:    34.1743  -54.0788    7.8610

OPERATOR 3
CENTER:   29.3812   -4.6379   12.2671

ROTA 1:    0.7539   -0.5901    0.2888
ROTA 2:   -0.5860   -0.8027   -0.1106
ROTA 3:    0.2971   -0.0858   -0.9510
TRANS:    19.1286    5.2864  -24.3021

Another way to view the results is

phenix.simple_ncs_from_pdb 4boz.pdb show_summary=true

Chains in model:
---------------------------------------------------
A    B    C    D    E
. . . . . . . . . . . . . . . . . . . . . . . . . .

NCS summary:
---------------------------------------------------
Number of NCS groups     :   2
Group #                  :   1
Number of copies         :   2
Chains in master         :   'A'
Chains in copies         :   'D'
Group #                  :   2
Number of copies         :   3
Chains in master         :   'B'
Chains in copies         :   'C', 'E'
. . . . . . . . . . . . . . . . . . . . . . . . . .

Transforms:
---------------------------------------------------
Group #                  :   1
Transform #              :   1
RMSD                     :   0
ROTA   0    1.0000    0.0000    0.0000
ROTA   1    0.0000    1.0000    0.0000
ROTA   2    0.0000    0.0000    1.0000
TRANS       0.0000    0.0000    0.0000
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
Transform #              :   2
RMSD                     :   0.720792956817
ROTA   0    0.7964   -0.5503    0.2507
ROTA   1   -0.5651   -0.8249   -0.0152
ROTA   2    0.2152   -0.1295   -0.9680
TRANS      -5.8295   14.4282  -25.8708
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
Group #                  :   2
Transform #              :   1
RMSD                     :   0
ROTA   0    1.0000    0.0000    0.0000
ROTA   1    0.0000    1.0000    0.0000
ROTA   2    0.0000    0.0000    1.0000
TRANS       0.0000    0.0000    0.0000
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
Transform #              :   2
RMSD                     :   0.796197645664
ROTA   0   -0.6661    0.2866   -0.6886
ROTA   1    0.2615   -0.7749   -0.5755
ROTA   2   -0.6986   -0.5634    0.4412
TRANS      43.6766  -46.3175  -10.0616
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
Transform #              :   3
RMSD                     :   0.772963874518
ROTA   0    0.7539   -0.5860    0.2971
ROTA   1   -0.5901   -0.8027   -0.0858
ROTA   2    0.2888   -0.1106   -0.9510
TRANS      -4.1022   13.4462  -28.0502

There are 5 chains in the PDB file (A,B,C,D,E). In the first group the master is A the the copy D and in the second group the master is B and the copy is C and E. Chain C is not in any group.

RMSD (A) from chain A:  0.0  0.72

shows the RMSD of matching atoms between the master and every other copy, The list of numbers is the list of matching residues by residue number. Note that this is not the exact selection, as it appears in: 4boz_simple_ncs_from_pdb.ncs.

ROTA and TRANS are the rotation and translation information. In the .ncs_spec file and the default on-screen representation of the results, the rotation and translation are:

Master = ROT x Copy + TRANS

While in the CCBTX implementation the common use is:

Copy = ROT x Master + TRANS

So the rotation/Translation are the inverse of each other in the two formats. Both formats are displayed in the summary.

A portion of the contents of the 4boz_simple_ncs_from_pdb.ncs_spec file, which you can edit if you want and which you can use in the AutoBuild Wizard, are shown below. NOTE: The ncs operators describe how to map the N'th ncs-related copy on to the first copy.

Summary of NCS information
Thu Apr  2 15:44:03 2015
/net/cci-filer2/raid1/home/...

new_ncs_group
new_operator

rota_matrix    1.0000    0.0000    0.0000
rota_matrix    0.0000    1.0000    0.0000
rota_matrix    0.0000    0.0000    1.0000
tran_orth     0.0000    0.0000    0.0000

center_orth   24.4880  -13.3177  -20.1848
CHAIN A
RMSD 0
MATCHING 155
  RESSEQ 147:150
  RESSEQ 152:211
  RESSEQ 213:275
  RESSEQ 280:305
  RESSEQ 307:308

new_operator

rota_matrix    0.7955   -0.5660    0.2164
rota_matrix   -0.5511   -0.8242   -0.1300
rota_matrix    0.2520   -0.0159   -0.9676
tran_orth    18.4021    5.3569  -23.3621

center_orth   15.9430   11.8822    0.6609
CHAIN D
RMSD 0.817
MATCHING 155
  RESSEQ 147:150
  RESSEQ 152:211
  RESSEQ 213:275
  RESSEQ 280:305
  RESSEQ 307:308

Possible Problems

Simple_ncs_from_pdb cannot find NCS groups: try relaxing search parameters,

e.g. set bigger chain_max_rmsd, smaller chain_similarity_threshold.

Simple_ncs_from_pdb produces very complicated selections, like

reference = (chain 'A' and (resid 147:150 or resid 152:194 or (resid 195 and
(name N or name CA or name C or name O or name CB )) or resid 196:211 or
resid 213:229 or (resid 230 and (name N or name CA or name C or name O or name
CB or name CG )) or resid 231:298 or (resid 299 and (name N or name CA or name
C or name O or name CB or name CG )) or resid 300:305 or resid 307:308))

This indicates that procedure is trying to exclude some residues or atoms from the selections. This could be caused by alternative confomations of residues (not supported), poor alignment of particular residues (try to increase residue_match_radius) or missing atoms in one of the chains.

Specific limitations and problems:

a large chain_similarity_threshold can prevent matching segments in chains that are of very different size.
The value of chain_max_rmsd, the RMSD limit between chains should be set with care. Small value might exclude chains with problems in the model, even though they are NCS related. Large value might cause non-NCS related chains and to be constraint as related.
The .ncs_spec files do not support representation of groups with multiple chains in master and copies. a group [(A,C),(B,D)] will be represented as [(A),(B)] and [(C),(D)], duplicating the rotation and translation operations.
The .ncs_spec files do not support representation atoms, residue name and insertion selection, it only shows the numbers of the matching residues.
The .ncs_spec files rotation and translation operation are

Master = ROT x Copy + TRANS

and not:

Copy = ROT x Master + TRANS

They are the inverse of the rotation and translation that are used in the implementation of the NCS relation.

Usage for validation

To run simple_ncs_from_pdb in validation mode one should supply NCS groups in form of phil file. The purpose of this mode is to illustrate how your NCS annotations will be filtered in phenix.refine and phenix.real_space_refine .

Why we filter annotations

We filter annotations because we need to be sure that:

number of atoms in reference and selections are the same
residues in reference and selections match each other
matched residues have the same atoms in all copies

Without all these NCS restraints and constraints cannot work reliably. We understand that it could be tricky to manually create such selections, especially for poor models where some residues may be not modelled yet or lacking some atoms. Therefore we filter user-supplied annotations before refinement by default and there is no way to turn it off.

How we filter annotations

For every NCS group supplied by a user we:

Select part of the model containing reference atoms and all selection atoms.
Then we search for NCS in this sub-model using parameters from ncs_search scope with some modifications. Precisely:
- chain_max_rmsd or 10, whichever is greater
- residue_match_radius or 1000, whichever is greater
- chain_similarity_threshold or 0.5 whichever is smaller.
These modifications are used to relax criteria for validation to a large extent to accomodate poor matches in poor models.
If found NCS relations cover all atoms selected in 1, validation passed, the NCS group is used as was defined by a user If found NCS relations do not cover all the initial atoms selected in 1, we create new NCS selections and use them further in refinement. If no NCS relations were found, the whole NCS group is discarded.

Note, we only filter supplied selections. Nothing extra will be added.

What you may see in the log file (example)

One may find something like this in the log-file:

Validating user-supplied NCS groups...
  Validating:
ncs_group {
  reference = "(chain A and resid 11:376 and not (resid 19:31 or resid 42:67 or resid 70:123 or resid 132:145 or resid 148:184 or resid 190:249 or resid 339:357 or resid 377:396))"
  selection = "(chain C and resid 11:376 and not (resid 19:31 or resid 42:67 or resid 70:123 or resid 132:145 or resid 148:184 or resid 190:249 or resid 339:357 or resid 377:396))"
}
  MODIFIED. Some of the atoms were excluded from your selection.
  The most common reasons are:
    1. Missing residues in one or several copies in NCS group.
    2. Presence of alternative conformations (they are excluded).
    3. Residue mismatch in requested copies.
  Please check the validated selection further down.

and further down something like:

Found NCS groups:
ncs_group {
  reference = (chain 'A' and (resid 11 through 18 or resid 32 through 41 or resid 68 through 6 \
9 or resid 124 through 125 or resid 129 through 131 or resid 146 through 147 or  \
resid 185 through 189 or resid 250 through 338 or resid 358 through 364 or resid \
 375 through 376))
  selection = (chain 'C' and (resid 11 through 18 or resid 32 through 41 or resid 68 through 6 \
9 or resid 124 through 131 or resid 146 through 147 or resid 185 through 189 or  \
resid 250 through 338 or resid 358 through 376))
}

This is how NCS selections were modified for one of the reasons listed above. In this example, they are essentially the same with slightly different syntax. Let's look at the differences.

In chain A residues 126-128 are excluded from selection. Indeed, there were no such residues in chain C.
In chain A residues 365-374 are excluded. Same as #1, absent in chain C.

These two are the only differences from your selection and they don't change anything! They look much different because we have to construct new selections from scratch (internal representation of selections) automatically.

When one sees something like this in new selections:

(resid 28 and (name N or name CA or name C or name O or name CB or name CG or name CD ))

this is usually because another chain is lacking particular atom(s). Therefore we select particular atoms of the residue that present in all other copies. In this particular case, residue 28 in chain A has NE and CZ atoms, while chain B lacks them.

Other messages one can see is:

REJECTED because copies don't match good enough.
Try to revise selections or adjust chain_similarity_threshold or
chain_max_rmsd parameters.

OK. All atoms were included in
validated selection.

List of all available keywords

pdb_in = None Input PDB file to be used to identify ncs
temp_dir = "" temporary directory (ncs_domain_pdb will be written there)
ncs_domain_pdb_stem = None NCS domains will be written to ncs_domain_pdb_stem+"group_"+nn
write_ncs_domain_pdb = False You can write out PDB files representing NCS domains for density modification if you want
show_ncs_phil = True Show phil parameters for the NCS groups
show_summary = False Show NCS information summary
write_spec_files = False Write .ncs_spec and .resolve representation of NCS groups
ncs_searchSet of parameters for NCS search procedure. Some of them also used for filtering user-supplied ncs_group.
- enabled = False Enable NCS restraints or constraints in refinement (in some cases may be switched on inside refinement program).
- exclude_selection = "element H or element D or water" Atoms selected by this selection will be excluded from the model before any NCS search and/or filtering procedures. There is no way atoms defined by this selection will be in NCS.
- chain_similarity_threshold = 0.85 Threshold for sequence similarity between matching chains. A smaller value may cause more chains to be grouped together and can lower the number of common residues
- chain_max_rmsd = 2. limit of rms difference between chains to be considered as copies
- residue_match_radius = 4.0 Maximum allowed distance difference between pairs of matching atoms of two residues
- try_shortcuts = False Try very quick check to speed up the search when chains are identical. If failed, regular search will be performed automatically.
- minimum_number_of_atoms_in_copy = 3 Do not create ncs groups where master and copies would contain less than specified amount of atoms
ncs_groupThe definition of one NCS group. Note, that almost always in refinement programs they will be checked and filtered if needed.
- reference = None Residue selection string for the complete master NCS copy
- selection = None Residue selection string for each NCS copy location in ASU