Description of the program SIMULAID: a collection of utilities designed to help setting up and analyze molecular simulations.

Mihaly Mezei

Department of Pharmacological Sciences,
Icahn Sinai School of Medicine at Mount Sinai,
New York, NY 100102

Mihaly.Mezei@mssm.edu

Nov. 19, 2017.

Reference: M. Mezei, Simulaid: a simulation facilitator and analysis program, J. Comp. Chem., 31, 2658-2668 (2010). DOI:10.1002/jcc.21551

 
 
0.  GENERAL INTRODUCTION TO SIMULATION SETUPS
I.  DESCRIPTION OF THE FUNCTIONS OF THE PROGRAM
II.  RUNNING THE PROGRAM - INPUT
III.  THE OUTPUT ON THE TERMINAL
IV.  FILE FORMATS
V.  INSTALLATION
VI.  CHANGING DIMENSION
VII.  IRIS GL GRAPHICS VERSION
VIII.  REFERENCES

0. GENERAL INTRODUCTION TO SIMULATION SETUPS

In general, a simulation requires three different types of information:

For analysis, the may also be needed.

Simulaid will help in various aspects of establishing these. In order to help clarify the various functionalities of Simulaid, a brief description is given of the various components of these descriptions. Frequent reference will be made to the PDB (Protein Data Bank) format that is considered the 'lingua franca' of molecular simulations (although it has a number of 'dialects') and to formats used by the simulation packages Amber, Charmm, Gromos/Gromacs, Discover, Macromodel, and MMC.

0.1. Description of the interaction energy terms

Most simulation packages rely on the concept of atomtype for the specification of the interaction potential, supplemented by a partial charge. It is generally assumed that force constants, dispersion and exhange repulsion terms are transferable to a larger extent than the partial charges.

The atomtypes can be defined by alphanumeric labels (Amber, Charmm, Gromos/Gromacs, Discover) or simple integers (Macromodel, OPLS) or both (MMC). In most cases, a parameter file contains the data necessary

0. GENERAL INTRODUCTION TO SIMULATION SETUPS

In general, a simulation requires three different types of information:

For analysis, the may also be needed.

Simulaid will help in various aspects of establishing these. In order to help clarify the various functionalities of Simulaid, a brief description is given of the various components of these descriptions. Frequent reference will be made to the PDB (Protein Data Bank) format that is considered the 'lingua franca' of molecular simulations (although it has a number of 'dialects') and to formats used by the simulation packages Amber, Charmm, Gromos/Gromacs, Discover, Macromodel, and MMC.

0.1. Description of the interaction energy terms

Most simulation packages rely on the concept of atomtype for the specification of the interaction potential, supplemented by a partial charge. It is generally assumed that force constants, dispersion and exhange repulsion terms are transferable to a larger extent than the partial charges.

The atomtypes can be defined by alphanumeric labels (Amber, Charmm, Gromos/Gromacs, Discover) or simple integers (Macromodel, OPLS) or both (MMC). In most cases, a parameter file contains the data necessary to extract the various coefficients to be used in the energy expressions (Amber, Charmm, Gromos/Gromacs, Discover) but they may be built into the program source code (Macromodel) or both (MMC).

0.2. Description of the molecules in the system

Most simulation packages use a hierarchical approach to define the assembly of atoms to be simulated.

Simulaid can read molecular structures in the following formats:

Depending on the input format used, Simulaid can also write in most of the formats listed above. It can not write a .mae file and it can only write a .mol2 file if the input was also in .mol2 format.

Since residues are considered frequently occuring molecular units, several packages (Amber, Charmm, Discover) use a so-called residue topology file (RTF) for their definition. For amino and nucleic acids the PDB nomenclature is followed, but only approximately. This standard includes definitions for amino acid and nucleic acid residues. These definitions consist of a three-letter residue name and atomnames that can be up to four characters long.

Simulaid can read topology files in the following formats:

To arrive at the fully defined system, most packages rely on an annotated coordinate file containing the Cartesian coordinates of the atoms in the system as well as on the RTF and parameter files. For Amber, the coordinate file is in the PDB format while for Charmm it is in the so-called CRD format. Once the list of residues and atoms are known to Amber or Charmm, they will turn to their RTF file where the combination of residue names and atomnames will provide for each atom their atomtypes and partial charges. The .car file of Discover, the .dat file of Macromodel, the .mol2 file of Tripos, as well as the .slt file of MMC, however will contain not only the residue names and atomnames, but the atom types and partial charges as well.

The record formats of the formatted input records can be listed by Simulaid.

The RTF file of Amber and Charmm also specifies the bonding pattern. Macromodel and Tripos, on the other hand, require the bond list in the coordinate file while MMC will deduce the bonding pattern from the coordinates themselves.

Segments are usually defined by the input stream.

0.3. Description of the boundary conditions.

The simulated system may be a droplet in vacuum or a unit cell of a periodic system. Each program has to be explicitly instructed in the input stream as to the size and shape of the periodic cell - the options vary from package to package. Charmm uses an additional file of the type .img that specifies the centers of all surrounding cells, but most packages use built-in routines for the different cell types allowed.

0.4. Trajectories. Simulaid can read trajectories in the following formats:

If a trajectory is in multiple files then the file names should be derived from the name of the first segment. Assuming that the trajectory starts with file x.dcd, subsequent segments should be called x_2.dcd, x_3.dcd, etc.

NOTE on binary files: it is important to compile Simulad with the same (type) of compilers as the program that generated the trajectory was compiled with. Binary files generated with Fortran-77 and Fortran-90 are different; similarly, they can be different for 32-bit and 64-bit systems.

0.5. Clustering options. Simulaid has implemented several clustering algorithms.

Single-link and maximum neighbor based clustering can be run either by specifying a distance cutoff, resulting in an unspecified number of clusters or by specifying the number of clusters in which case the cutoff is udjusted until this number of cluster results.

Furthermore, single-link clustering can be used to do a hierarchical clustering using several different distance cutoffs. In this case clusters obtained with a larger cutoff are further clustered with the next smaller cutoff.

I. DESCRIPTION OF THE FUNCTIONS OF THE PROGRAM

I.0. Online help: Any prompt containing a '?' within the bracketed field (e.g., 'Do you want ... [y/n/?] ?' or 'Number of ... [100,?]=') will respond with an explanation if the user types just a '?'. Some of the prompts asking for a selection from a list also has a help option that can be activated by typing a '?'.

I.1. Operations performed by several functions:

Online tips: Any prompt containing a '+' within the bracketed field (e.g., 'Do you want ... [y/n/+] ?') will provide a tip if the user types just a '+'.

In each case, the prompt is repeated after the explanation has been printed.

For translation/rotation and trajectory conversions Simulaid offers the option to reset the system into the periodic cell. The use of periodic boundary conditions may be restricted to two or one dimensions; i.e., the resetting is applied only in a selected coordinae plane or along a selected axis.

This option also selects periodic images of each solute molecule in such a way that the solute appears as compact as possible. Since Simulaid considers each solute segment a single molecule, this may result in moving groups of molecules (e.g., lipids) always together. Thus, for functions that may involve resetting the system into a PBC cell, Simulaid offers the users the option to define a number of residues as molecular residues, i.e., groups of atoms that are to be reset into the PBC cell together but independent of the rest of the solute. Furthermore, residues can also be defined as ions - these are also treated individually but they are not moved around only to keep them in the simulation cell, but not in an attempt to make the solute more compact,

I.2. Specific operations performed by Simulaid:

II.RUNNING THE PROGRAM - INPUT

The program is run interactively: the user is first asked to select the operation to be performed then (if needed) the

  • Creation of a Grasp .crg file involves specifying the RTF file name(s). Note that the Grasp format limits atom names to four characters and residue names to three characters. If either of these limits is exceded, the program will truncate the names and print a warning. Also, the program requires a valid 'input' filename, but it will be ignored. Each solute segment/chain is colored differently (using six colors, cyclicly repeated: 1:red; 2:green; 3:yellow; 4:blue; 5:pink; 6; cyan), solvents are colored white. For simulations with variable box size the smallest and largest PBC boxes are also drawn. For multisegment solutes one segment can be highlighted. After each pass option is given to repeat the animation and change the value for the length of waiting time between showing two frames. Each time the animation pauses the display can also be manipulated by keyboard commands. The length of wait is defined in a hardware dependent arbitrary unit. There is also an option to reorient the system to the orientation of the initial configuration (useful for animating Monte Carlo runs).

    For most runs with conversions or with cleanup the program asks the number of additional structures to process in the same way. This allows the processing of a batch of files with one run. For Charmm format output, the segment id of the additional structures will be appended by 1,2, etc., as long as there is 'room' for this additional information in the four-digit sequence id (i.e., the original segment id contained enough blanks).

    Additional input structures may be either in the same file or in separate files - you will be asked which. Each file may contain several structures - you will be asked how many. If they are in separate files, and the first file was called ether A.B or A.1.B then the remaining files have to be called A.2.B, A.3.B, etc. Similar naming convention applies to the output files.

    The output file(s) will contain the modified molecule. Except for runs requesting a transformation of syntax, it will use the same syntax type as the input file. The header comments for Charmm and PDB formats will be extended with a description of the transformation(s) done by the program and/or the results of the optimization.

    Running Simulaid on a batch of files

    Simulaid has been succesfully run from a c-shell script on a batch of files. The script requires a copy of all the answers to the interactive quiz durig the Simulaid run. This can be obtained by running Simulaid and turn on on input logging. It is advisable to set the input predictable in this case. This list has to be put into the script file between the simulaid << fin command and the terminator string fin. In the example below all .pdb files in the directory are converted from PDB atom/residue name convention to Charmm's convention. Also, the generation of the list of files to be processed has to be adapted to each case.

    #! /bin/csh -f
    #
    # script to convert from PDB name convention to Charmm convention
    #
    set files = `ls *.pdb`
    # The variable $file lists all the .pdb files in this directory
    foreach file ($files)
    set file1 = `echo $file | sed s/\.pdb/_ch\.pdb/`
    # Generated a new file name from the priginal .pdb file name to use as output
    echo Reading $file and writing file $file1
    simulaid << fin >> junk
    p
    n
    f
    $file
    b
    $file1
    y
    
    
    9
    3
    
    
    
    fin
    end
    #rm junk
    exit
    

    III. THE OUTPUT ON THE TERMINAL

    The terminal output will include the title cards and the number of atoms found in the input file (the number of hydrogens also if they are not to be considered) for all types of runs.

    • For enclosing sphere calculation:
      • The initial radius found based on the center determined by shifting along the X, Y, and Z axes by (X(min)+X(max))/2, (Y(min)+Y(max))/2, (Z(min)+Z(max))/2, respectively, where X(min), X(max) are the smallest and largest X coordinates, etc.;
      • The optimized radius;
      • The volume of the initial and optimized sphere and the number of waters required to fill it in.
      • The volume of the initial and optimized spheres increased by the water layer width and the number of waters required to fill it in.
      • The vector shifting the molecule to the center of the smallest sphere.
      • If solvents were present in the input file, the number of them remaining that is within the layer of specified width around the optimized sphere.
    • For minimizing the bounding box:
      • The initial bounding box sixe
      • Messages describing the progress of optimization.
      • The smallest bounding boxes found in each minimization try
      • The optimal bounding box sixe
      • The vector shifting the molecule's COM to (0,0,0) and the rotation matrix to the optimal orientation
    • For orientation optimization:
      • The shortest distance between atoms on different images in the original and in the new orientation. The optimization will be performed starting from the input orientation as well as from a user-defined number of different random orientations.
      • Messages describing the progress of optimization.
      • The cell parameter(s), like the edge lengths, corresponding to the cell shape used in the calculation.
      • The cell volume and the number of waters required to fill it in (without the solute present).
      • The vector shifting the molecule's COM to (0,0,0) and the rotation matrix to the optimal orientation
    • For solvent geometry fixing calculations:
      • The number of solvents found.
    • For elimination of waters outside a simulation cell:
      • The number of waters kept and the number of non-water atoms found.
    • The cleanup operations, performed automatically at the end of the run or at special request. Note, that automatically invoked cleanup operations start the residue and atomnumbers from one.
      • The number of atoms found detached from their residue
      • A message if resequencing was or was not needed
    • Conversion output
      • For each type of conversion a variety of warnings or error messages may be obtained indicating that the input file does not conform to some assumptions or the templates used are not prepared for the particular input. These messages should be heeded.
      • For residue-name bearing input a sequence list can be written to a file.
    • The analysis option provides a menu from which a choice should be made and ask if trajectory scan is requested. After the analysis requested is performed the program returns to this menu allowing the user to do an other one or to stop. If additional configurations are read from the input file, the last analysis will be repeated.
      • Each type of analysis prints the name of the output file (generated from the input file name by adding a special suffix) where the requested information is written.
      • Various statistics generated over the lists are also printed on the terminal.
    Furthermore, several internal consistency checks are implemented to increase the chance of detecting any error that might have remained in the program. Failing such a test results in a message preceded by PROGRAM ERROR. This always indicates a bug in the program and should be brought to the attention of the author, together with the input file that was processed when the error occurred.

    IV. FILE FORMATS

    • The name conversion file pdb_nam.dat is of the following syntax:
      • Part 1: nconv,ncol (free format)
        • nconv: Number of name conventions included.
        • ncol: number of data columns in the file.
      • Part 2: 8-character names of the conventions (in one line):
        "Amber   ", "Amber94 ", "Charmm  ", "Moil    ", "IUPAC   ", "Macromod","ChemX   ", " PDB     "
        PDB has to be the last convention.
      • Part 3: Data for residue name conversions. Each line has ncol entries.
        • The first entry is a generic residue name.
        • Entries two to nconv+1 are the actual residue names corresponding to possible variants in the various conventions.
        • Entries from nconv+2 to ncol are possible additional PDB variants. A blank entry means that conversions to that convention of the corresponding atoms will retain the original name. An entry of "****" means that conversions to that convention will result in dropping those atoms.
        • This section is concluded with a line for residue name "DONE".
      • Part 4: Data for atom name conversions. Each line has ncol entries. The first entry is a generic residue name as defined in Part 3. Entries two to nconv+1 are the atom names corresponding to the variants in the various conventions.

        Entries from nconv+2 to ncol are possible additional PDB variants. Entries for PDB varians have to conform to the PDB convention requiring that atom names for elements with single-character chemical id be in the second position - leaving the first one either blank or filled with a number (e.g., "_OD1" or "2HA1").

        There are a number of entries with blank generic residue names - they represent names that convert to each other for any residue. This section is also concluded with a line for residue name "DONE".

    • The template file for the orders of the atoms within each residue has 8-character entries in the form "RESN    ","ATNM    ". The current versions have been directly generated from the RTF files of Amber and Charmm.
    • The file containing the information needed to convert to MMC Monte Carlo input has 8-character entries "RESN    ", "ATNM    ", "PFCD    ", "Charge","GROUPa b" where
      • RESN is the residue name
      • ATNM is the atom name
      • PFCD is the left-adjusted atom-type name
      • Charge is the partial charge given with decimal point
      • GROUPa b gives information about the carous charge groups within a residue. For each group, a=1 for the first atom of the group and a=0 for the rest; b is the sequence number of the group within the residue (i.e., b=1,...,b=2,..., etc).
      Again, they have been extracted from the RTF files of Amber and Charmm, respectively. Note that the atom order template file has the same syntax as the MMC conversion file for the residue and atomnames thus an MMC conversion file can be also used for atom order template file.
    V. INSTALLATION

    The distribution contains the source code and some data files. Installation of the program requires the following steps:

    • Putting these data files into a directory of your choice, referred to as the installation directory.
    • Changing in the block data
      • the string MYPATH to the path of the installation directory
      • the fourth number in the data ldatapaths statement (00) to the number of characters in /MYPATH/
    • Removing the C@** (i.e., deleting the first four characters in the line) comments from lines corresponding to your environment in the subroutine datprt (near the end of the source code) for the printing of the date on output files as follows:
      • "C@UG" --> "" - SGI IRIX Fortran compiler
      • "C@G7" --> "" - g77 Fortran compiler
      • "C@AB" --> "" - Absoft Fortran compiler
      • "C@EF" --> "" - Intel Fortran compiler
      • "C@HP" --> "" - Hewlett-Packard Fortran compiler
      • "C@AX" --> "" - IBM AIX Fortran compiler
    • The subroutine datprt checks if the program is executed on certain hostnames, assumed to be the headnodes of a cluster. The hostnames (e.g., minerva2 in the distribution copy) should be replaced with the actual hostname. This will enable Simulaid to warn the user that longer runs should be sent to a compute node.
    • Compile the program with Fortran, e.g.,:
      f77 -o simulaid.bin simulaid.f
      to obtain the executable simulaid.bin

      Some compilers fail due to a so-called 'relocation error' when optimizing at levels higher than one is asked. When using the Intel Fortran compiler (ifort), adding the compiler directives
      -mcmodel=medium -share_intel
      solved the problem. With some of the other compiler (but not the GNU compiler) the compilation key -fpic was found to solve the problem.

    • If the program is to be compiled with Fortran95 then there is a preprocessor in the distribution f77tof95.f and f77tof95_f95.f (the second one is the Fortran95 version of the first one) that changes the syntax to conform to Fortran95 requirements.
    The distribution directory also contains the files nxyz.frm, sxyz.frm and sxyzrq.frm that give the decription of the data format for some of the InsightII free-formats. The Charmm image files for the non-rectangular boundary conditions, hexy.img, hexz.img, fcc.img, to.img, and hcp.img are also given.

    VI. CHANGING DIMENSIONS

    The sizes of the arrays are established with parameter statements throughout the code. Several symbols user used for this purpose. There are certain relations between these symbols, so changing one of them is likely to require changes in some others. Below is a list of these symbols and their relations that have to be maintained (the program checks for violations).

    IMPORTANT: Parameter statements for most symbols occur several places in the program. When a change is required, it has to be carried out at ALL occurences!

    • MAXPHI: maximum number of Delphi gridpoints
        Also, MAXPHI3 is the memory used for 2D-data - see below the parameter definitions depending on MAXPHI
    • MAXREC: Maximum number of records to read
        > maximum number of atoms
        must be divisible by 20
    • MAXRSD: Maximum number of residues
    • MAXNEIG: Maximum number of neighbors of an atom in the neighbor list
        (7+2*MAXNEIG)*MAXREC < MAXPHI3
    • MAXBONDS: Maximum number of bonds (hydrogen, hydrophobic or salt bridge)
    • MAX2D: maximum number of frames for the 2-D RMSD map
        < MAXBONDS
        4*MAXBONDS+MAX2D+2*MAX2D2 < MAXPHI3
    • MAXCV: maximum number of residues for the CV-based domain map
        8*MAXCV+5*MAXCV2 < MAXPHI3 2*MAXRCORR*2+16*MAXRCORR < MAXPHI3
    • MAXCONN: maximum number of residues for the residue distance map
        3*MAXCONN2-11*MAXCONN < MAXPHI3
    The current version uses MAXREC=200000, MAXRSD=70000, MAXNEIG=25, MAXPHI=400, MAXBONDS=10000, MAX2D=5000, MAXCV=2000, MAXRCORR=5000 . Valid values for a larger version (ca. 1.1 Gb) are MAXPHI=640, MAX2D=10000, MAXCV=800, MAXRCORR=10000.

    In additions to the symbols above, there are other size parameters whose value can be chosen independently (the current value is in braces):

    • MAXFRAMES {50000}: the number of trajectory frames to store for analysis
    • MAXCOPY {600}: the number of different data (e.g., torsion angles) per frame to store for analysis
    • MAXHX {50}: the maximum number of residues per protein helix
    • MAXNHX {16}: the maximum number of helices for a single TRAJHELIX run
    • MAXBRDIGEATOM {8000}: Maximum number of bridge anchor atoms:
    • MAXBRDIGETYPE {500}: Maximum number of bridge destination atoms:
    • MAXBRIDGELEN {4}: Maximum number of H bonds in a bridge:
    • MAXDDISTR {200}: Maximum number of atom pairs to calculate the distance distribution
    • MAXDDBIN {20}: The number of distance bins in the distance distributions
    • MAXCDLIST {2000}: The total number of atom pairs that can be involved in distance distribution calculation

    VII. IRIS GL GRAPHICS VERSION

    Earlier versions of Simulaid included optional code for graphics output. Since support for Iris GL stoppped in the early 90s, the graphics code has been removed from Simulaid. The latest version with this code is saved as simulaid_IGL.f and simulaid_IGL.html.

    VIII. REFERENCES

    1. O. Lemke and B.G. Keller, Density-based cluster algorithms for the identification of core sets, J. Chem. Phys.,145, 164104 (2016). DOI:10.1063/1.4965440.

    2. M. Mezei, Optimal Position of the Solute for Simulations, J. Comp. Chem. 18,812-815 (1997). DOI:10.1002/(SICI)1096-987X(19970430)18:6<812::AID-JCC6>3.0.CO;2-V.

    3. M. Mezei, Finding of the smallest enclosing cube to improve molecular modeling, Information Newsletter for Computer Simulation of Condensed Phases, CCP5, Daresbury Lab., No 47, (2000).

    4. J. Cornette, K. B. Cease, H. Margalit, J. L. Spouge, J. A. Berzofsky and C. DeLisi, Hydrophobicity scales and computational techniques for detecting amphipathic structures in proteins, J. Mol. Biol., 195, 659-685 (1987). DOI:10.1016/0022-2836(87)90189-6.

    5. M. Mezei, A new method for mapping macromolecular topography, J. Molec. Modeling and Simul., 21, 463-472 (2003). DOI:10.1016/S1093-3263(02)00203-6.

    6. Adam D. Schuyler, Heather A. Carlson, and Eva L. Feldman, Computational Methods for Predicting Sites of Functionally Important Dynamics J. Phys. Chem. B,113, 6613-6622 (2009) 10.1021/jp808736c

    7. I. Visiers, B.B. Braunheim, and H. Weinstein, Prokink: a protocol for numerical evaluation of helix distortions by proline. Prot. Engrg., 13, 603-606 (2000). DOI:10.1093/protein/13.9.603

    8. M. Mezei and M. Filizola, TRAJELIX: A computational tool for the geometric characterization of protein helices during molecular dynamics simulations. J. Computer-Aided Molecular Design, 20, 97-107 (2006). DOI:10.1007/s10822-006-9039-1.

    9. D. Cremer and J.A. Pople, General definition of ring puckering coordinates, J. Am. Chem. Soc., 97, 1354 (1975). DOI:10.1021/ja00839a011.

    10. C. Altona and M. Sundaralingam, Conformational analysis of the sugar ring in nucleosides and nucleotides. New description using the concept of pseudorotation J. Am. Chem. Soc., 94, 8205 (1972). DOI:10.1021/ja00778a043.

    11. W. Kabsch and C. Sander, Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 22, 2577-2637 (1983). DOI:10.1002/bip.360221211.