Description of the program SIMULAID: a collection of utilities designed to help setting up and analyze molecular simulations.

Mihaly Mezei

Department of Pharmacological Sciences,
Icahn Sinai School of Medicine at Mount Sinai,
New York, NY 100102

Mihaly.Mezei@mssm.edu

Jan. 12, 2017.

Reference: M. Mezei, Simulaid: a simulation facilitator and analysis program, J. Comp. Chem., 31, 2658-2668 (2010). DOI:10.1002/jcc.21551

 
 
0.  GENERAL INTRODUCTION TO SIMULATION SETUPS
I.  DESCRIPTION OF THE FUNCTIONS OF THE PROGRAM
II.  RUNNING THE PROGRAM - INPUT
III.  THE OUTPUT ON THE TERMINAL
IV.  FILE FORMATS
V.  INSTALLATION
VI.  CHANGING DIMENSION
VII.  IRIS GL GRAPHICS VERSION
VIII.  REFERENCES

0. GENERAL INTRODUCTION TO SIMULATION SETUPS

In general, a simulation requires three different types of information:

For analysis, the may also be needed.

Simulaid will help in various aspects of establishing these. In order to help clarify the various functionalities of Simulaid, a brief description is given of the various components of these descriptions. Frequent reference will be made to the PDB (Protein Data Bank) format that is considered the 'lingua franca' of molecular simulations (although it has a number of 'dialects') and to formats used by the simulation packages Amber, Charmm, Gromos/Gromacs, Discover, Macromodel, and MMC.

0.1. Description of the interaction energy terms

Most simulation packages rely on the concept of atomtype for the specification of the interaction potential, supplemented by a partial charge. It is generally assumed that force constants, dispersion and exhange repulsion terms are transferable to a larger extent than the partial charges.

The atomtypes can be defined by alphanumeric labels (Amber, Charmm, Gromos/Gromacs, Discover) or simple integers (Macromodel, OPLS) or both (MMC). In most cases, a parameter file contains the data necessary

0. GENERAL INTRODUCTION TO SIMULATION SETUPS

In general, a simulation requires three different types of information:

For analysis, the may also be needed.

Simulaid will help in various aspects of establishing these. In order to help clarify the various functionalities of Simulaid, a brief description is given of the various components of these descriptions. Frequent reference will be made to the PDB (Protein Data Bank) format that is considered the 'lingua franca' of molecular simulations (although it has a number of 'dialects') and to formats used by the simulation packages Amber, Charmm, Gromos/Gromacs, Discover, Macromodel, and MMC.

0.1. Description of the interaction energy terms

Most simulation packages rely on the concept of atomtype for the specification of the interaction potential, supplemented by a partial charge. It is generally assumed that force constants, dispersion and exhange repulsion terms are transferable to a larger extent than the partial charges.

The atomtypes can be defined by alphanumeric labels (Amber, Charmm, Gromos/Gromacs, Discover) or simple integers (Macromodel, OPLS) or both (MMC). In most cases, a parameter file contains the data necessary to extract the various coefficients to be used in the energy expressions (Amber, Charmm, Gromos/Gromacs, Discover) but they may be built into the program source code (Macromodel) or both (MMC).

0.2. Description of the molecules in the system

Most simulation packages use a hierarchical approach to define the assembly of atoms to be simulated.

Simulaid can read molecular structures in the following formats:

Depending on the input format used, Simulaid can also write in most of the formats listed above. It can not write a .mae file and it can only write a .mol2 file if the input was also in .mol2 format.

Since residues are considered frequently occuring molecular units, several packages (Amber, Charmm, Discover) use a so-called residue topology file (RTF) for their definition. For amino and nucleic acids the PDB nomenclature is followed, but only approximately. This standard includes definitions for amino acid and nucleic acid residues. These definitions consist of a three-letter residue name and atomnames that can be up to four characters long.

Simulaid can read topology files in the following formats:

To arrive at the fully defined system, most packages rely on an annotated coordinate file containing the Cartesian coordinates of the atoms in the system as well as on the RTF and parameter files. For Amber, the coordinate file is in the PDB format while for Charmm it is in the so-called CRD format. Once the list of residues and atoms are known to Amber or Charmm, they will turn to their RTF file where the combination of residue names and atomnames will provide for each atom their atomtypes and partial charges. The .car file of Discover, the .dat file of Macromodel, the .mol2 file of Tripos, as well as the .slt file of MMC, however will contain not only the residue names and atomnames, but the atom types and partial charges as well.

The record formats of the formatted input records can be listed by Simulaid.

The RTF file of Amber and Charmm also specifies the bonding pattern. Macromodel and Tripos, on the other hand, require the bond list in the coordinate file while MMC will deduce the bonding pattern from the coordinates themselves.

Segments are usually defined by the input stream.

0.3. Description of the boundary conditions.

The simulated system may be a droplet in vacuum or a unit cell of a periodic system. Each program has to be explicitly instructed in the input stream as to the size and shape of the periodic cell - the options vary from package to package. Charmm uses an additional file of the type .img that specifies the centers of all surrounding cells, but most packages use built-in routines for the different cell types allowed.

0.4. Trajectories. Simulaid can read trajectories in the following formats:

If a trajectory is in multiple files then the file names should be derived from the name of the first segment. Assuming that the trajectory starts with file x.dcd, subsequent segments should be called x_2.dcd, x_3.dcd, etc.

NOTE on binary files: it is important to compile Simulad with the same (type) of compilers as the program that generated the trajectory was compiled with. Binary files generated with Fortran-77 and Fortran-90 are different; similarly, they can be different for 32-bit and 64-bit systems.

0.5. Clustering options. Simulaid has implemented several clustering algorithms.

Single-link and maximum neighbor based clustering can be run either by specifying a distance cutoff, resulting in an unspecified number of clusters or by specifying the number of clusters in which case the cutoff is udjusted until this number of cluster results.

Furthermore, single-link clustering can be used to do a hierarchical clustering using several different distance cutoffs. In this case clusters obtained with a larger cutoff are further clustered with the next smaller cutoff.

I. DESCRIPTION OF THE FUNCTIONS OF THE PROGRAM

I.0. Online help: Any prompt containing a '?' within the bracketed field (e.g., 'Do you want ... [y/n/$] ?' or 'Number of ... [100,$]=') will respond with an explanation if the user types just a '?'. Some of the prompts asking for a selection from a list also has a help option that can be activated by typing a '?'. In each case, the prompt is repeated after the explanation has been printed.

I.1. Operations performed by several functions:

For translation/rotation and trajectory conversions Simulaid offers the option to reset the system into the periodic cell. The use of periodic boundary conditions may be restricted to two or one dimensions; i.e., the resetting is applied only in a selected coordinae plane or along a selected axis.

This option also selects periodic images of each solute molecule in such a way that the solute appears as compact as possible. Since Simulaid considers each solute segment a single molecule, this may result in moving groups of molecules (e.g., lipids) always together. Thus, for functions that may involve resetting the system into a PBC cell, Simulaid offers the users the option to define a number of residues as molecular residues, i.e., groups of atoms that are to be reset into the PBC cell together but independent of the rest of the solute. Furthermore, residues can also be defined as ions - these are also treated individually but they are not moved around only to keep them in the simulation cell, but not in an attempt to make the solute more compact,

I.2. Specific operations performed by Simulaid:

II.RUNNING THE PROGRAM - INPUT

The program is run interactively: the user is first asked to select the operation to be performed then (if needed) the

  • Creation of a Grasp .crg file involves specifying the RTF file name(s). Note that the Grasp format limits atom names to four characters and residue names to three characters. If either of these limits is exceded, the program will truncate the names and print a warning. Also, the program requires a valid 'input' filename, but it will be ignored. Each solute segment/chain is colored differently (using six colors, cyclicly repeated: 1:red; 2:green; 3:yellow; 4:blue; 5:pink; 6; cyan), solvents are colored white. For simulations with variable box size the smallest and largest PBC boxes are also drawn. For multisegment solutes one segment can be highlighted. After each pass option is given to repeat the animation and change the value for the length of waiting time between showing two frames. Each time the animation pauses the display can also be manipulated by keyboard commands. The length of wait is defined in a hardware dependent arbitrary unit. There is also an option to reorient the system to the orientation of the initial configuration (useful for animating Monte Carlo runs).

    For most runs with conversions or with cleanup the program asks the number of additional structures to process in the same way. This allows the processing of a batch of files with one run. For Charmm format output, the segment id of the additional structures will be appended by 1,2, etc., as long as there is 'room' for this additional information in the four-digit sequence id (i.e., the original segment id contained enough blanks).

    Additional input structures may be either in the same file or in separate files - you will be asked which. Each file may contain several structures - you will be asked how many. If they are in separate files, and the first file was called ether A.B or A.1.B then the remaining files have to be called A.2.B, A.3.B, etc. Similar naming convention applies to the output files.

    The output file(s) will contain the modified molecule. Except for runs requesting a transformation of syntax, it will use the same syntax type as the input file. The header comments for Charmm and PDB formats will be extended with a description of the transformation(s) done by the program and/or the results of the optimization.

    Running Simulaid on a batch of files

    Simulaid has been succesfully run from a c-shell script on a batch of files. The script requires a copy of all the answers to the interactive quiz durig the Simulaid run. This can be obtained by running Simulaid and turn on on input logging. It is advisable to set the input predictable in this case. This list has to be put into the script file between the simulaid << fin command and the terminator string fin. In the example below all .pdb files in the directory are converted from PDB atom/residue name convention to Charmm's convention. Also, the generation of the list of files to be processed has to be adapted to each case.

    #! /bin/csh -f
    #
    # script to convert from PDB name convention to Charmm convention
    #
    set files = `ls *.pdb`
    # The variable $file lists all the .pdb files in this directory
    foreach file ($files)
    set file1 = `echo $file | sed s/\.pdb/_ch\.pdb/`
    # Generated a new file name from the priginal .pdb file name to use as output
    echo Reading $file and writing file $file1
    simulaid << fin >> junk
    p
    n
    f
    $file
    b
    $file1
    y
    
    
    9
    3
    
    
    
    fin
    end
    #rm junk
    exit
    

    III. THE OUTPUT ON THE TERMINAL

    The terminal output will include the title cards and the number of atoms found in the input file (the number of hydrogens also if they are not to be considered) for all types of runs.

    Furthermore, several internal consistency checks are implemented to increase the chance of detecting any error that might have remained in the program. Failing such a test results in a message preceded by PROGRAM ERROR. This always indicates a bug in the program and should be brought to the attention of the author, together with the input file that was processed when the error occurred.

    IV. FILE FORMATS

    V. INSTALLATION

    The distribution contains the source code and some data files. Installation of the program requires the following steps:

    The distribution directory also contains the files nxyz.frm, sxyz.frm and sxyzrq.frm that give the decription of the data format for some of the InsightII free-formats. The Charmm image files for the non-rectangular boundary conditions, hexy.img, hexz.img, fcc.img, to.img, and hcp.img are also given.

    VI. CHANGING DIMENSIONS

    The sizes of the arrays are established with parameter statements throughout the code. Several symbols user used for this purpose. There are certain relations between these symbols, so changing one of them is likely to require changes in some others. Below is a list of these symbols and their relations that have to be maintained (the program checks for violations).

    IMPORTANT: Parameter statements for most symbols occur several places in the program. When a change is required, it has to be carried out at ALL occurences!

    The current version uses MAXREC=200000, MAXRSD=70000, MAXNEIG=25, MAXPHI=400, MAXBONDS=10000, MAX2D=5000, MAXCV=2000, MAXRCORR=5000 . Valid values for a larger version (ca. 1.1 Gb) are MAXPHI=640, MAX2D=10000, MAXCV=800, MAXRCORR=10000.

    In additions to the symbols above, there are other size parameters whose value can be chosen independently (the current value is in braces):

    VII. IRIS GL GRAPHICS VERSION

    Earlier versions of Simulaid included optional code for graphics output. Since support for Iris GL stoppped in the early 90s, the graphics code has been removed from Simulaid. The latest version with this code is saved as simulaid_IGL.f and simulaid_IGL.html.

    VIII. REFERENCES

    1. O. Lemke and B.G. Keller, Density-based cluster algorithms for the identification of core sets, J. Chem. Phys.,145, 164104 (2016). DOI:10.1063/1.4965440.

    2. M. Mezei, Optimal Position of the Solute for Simulations, J. Comp. Chem. 18,812-815 (1997). DOI:10.1002/(SICI)1096-987X(19970430)18:6<812::AID-JCC6>3.0.CO;2-V.

    3. M. Mezei, Finding of the smallest enclosing cube to improve molecular modeling, Information Newsletter for Computer Simulation of Condensed Phases, CCP5, Daresbury Lab., No 47, (2000).

    4. J. Cornette, K. B. Cease, H. Margalit, J. L. Spouge, J. A. Berzofsky and C. DeLisi, Hydrophobicity scales and computational techniques for detecting amphipathic structures in proteins, J. Mol. Biol., 195, 659-685 (1987). DOI:10.1016/0022-2836(87)90189-6.

    5. M. Mezei, A new method for mapping macromolecular topography, J. Molec. Modeling and Simul., 21, 463-472 (2003). DOI:10.1016/S1093-3263(02)00203-6.

    6. Adam D. Schuyler, Heather A. Carlson, and Eva L. Feldman, Computational Methods for Predicting Sites of Functionally Important Dynamics J. Phys. Chem. B,113, 6613-6622 (2009) 10.1021/jp808736c

    7. I. Visiers, B.B. Braunheim, and H. Weinstein, Prokink: a protocol for numerical evaluation of helix distortions by proline. Prot. Engrg., 13, 603-606 (2000). DOI:10.1093/protein/13.9.603

    8. M. Mezei and M. Filizola, TRAJELIX: A computational tool for the geometric characterization of protein helices during molecular dynamics simulations. J. Computer-Aided Molecular Design, 20, 97-107 (2006). DOI:10.1007/s10822-006-9039-1.

    9. D. Cremer and J.A. Pople, General definition of ring puckering coordinates, J. Am. Chem. Soc., 97, 1354 (1975). DOI:10.1021/ja00839a011.

    10. C. Altona and M. Sundaralingam, Conformational analysis of the sugar ring in nucleosides and nucleotides. New description using the concept of pseudorotation J. Am. Chem. Soc., 94, 8205 (1972). DOI:10.1021/ja00778a043.

    11. W. Kabsch and C. Sander, Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 22, 2577-2637 (1983). DOI:10.1002/bip.360221211.