"SMILES" is an acronym for Simplified Molecular Input Line Entry Specification. SMILES was developed to allow the unambiguous specification of a chemical structure via a text string. Note that other text string-based molecular specification languages exist, including WLN, SLN, and
InChI. SMILES are more easily read by a human than some of the alternative notations.
An explanation of SMILES notation and some examples follows:
Atoms:
Atoms are represented by the standard abbreviation of the chemical elements, in square brackets, such as [Au] for gold. The hydroxide anion is [OH-]. Brackets can be omitted for the "organic subset" of B, C, N, O, P, S, F, Cl, Br, and I. All other elements must be enclosed in brackets. If the brackets are omitted, the proper number of implicit hydrogen atoms is assumed; for instance the SMILES for water may be written simply O.
Bonds:
Bonds between aliphatic atoms are assumed to be single unless specified otherwise and are implied by adjacency in the SMILES. For example the SMILES for ethanol can be written as CCO. Ring closure labels are used to indicate connectivity between non-adjacent atoms in the SMILES, which for cyclohexane and dioxane can be written as C1CCCCC1 and O1CCOCC1 respectively. Double and triple bonds are represented by the symbols '=' and '#' respectively as illustrated by the SMILES O=C=O (carbon dioxide) and C#N (hydrogen cyanide).
Branching:
Branches are described with parentheses, as in CCC(=O)O for propionic acid and C(F)(F)F for fluoroform. Substituted rings can be written with the branching point in the ring as illustrated by the SMILES COc(c1)cccc1C#N and COc(cc1)ccc1C#N which encode the 3 and 4-cyanoanisole isomers. Writing SMILES for substituted rings in this way can make them more human-readable.
Aromaticity:
Aromatic C, O, S and N atoms are shown in their lower case 'c', 'o', 's' and 'n' respectively. Benzene, pyridine and furan can be represented respectively by the SMILES c1ccccc1, n1ccccc1 and o1cccc1. Bonds between aromatic atoms are, by default, aromatic although these can be specified explicitly using the ':' symbol. Aromatic atoms can be singly bonded to each other and biphenyl can be represented by c1ccccc1-c2ccccc2. Aromatic nitrogen bonded to hydrogen, as found in pyrrole must be represented as [nH] and imidazole is written in SMILES notation as n1c[nH]cc1.
Stereochemistry:
Configuration around double bonds is specified using the characters "/" and "\". For example, F/C=C/F is one representation of trans-difluoroethene, in which the fluorine atoms are on opposite sides of the double bond, whereas F/C=C\F is one possible representation of cis-difluoroethene, in which the Fs are on the same side of the double bond, as shown in the figure.
Configuration at tetrahedral carbon is specified by @ or @@. L-Alanine, the more common enantiomer of the amino acid alanine can be written as N[C@@H](C)C(=O)O. The @@ specifier indicates that, when viewed from nitrogen along the bond to the chiral center, the sequence of substituents hydrogen (H), methyl (C) and carboxylate (C(=O)O)appear clockwise. D-Alanine can be written as N[C@H](C)C(=O)O. The order of the substituents in the SMILES string is very important and D-alanine can also be encoded as N[C@@H](C(=O)O)C.
Isotopes:
Isotopes are specified with a number equal to the integer isotopic mass preceding the atomic symbol. Benzene in which one atom is carbon-14 is written as [14c]1ccccc1 and deuterochloroform is [2H]C(Cl)(Cl)Cl.
Other Examples:
CC(=O)Oc1ccccc1C(O)=O is acetylsalicylic acid (Aspirin).