After breaking down multiple component entries to single components, the next step is to analyze these components in more detail and identify different isoforms/representations of the same molecule to make sure the same compounds get the same ID, and different compounds get different IDs. This requires the detection of isomers.
Part 2: Isomer detection
The required level of handling isomers depends on the purpose of the database, and it is worth to decide which level is the best in your case. In short you have to decide what is the same and what is different. In mcule, we are building a vendor database comprising of many millions of molecules. These molecules were drawn by different individuals following different conventions and using different molecule sketchers inevitable resulting in different representations of the same molecule. For our database, we have to identify different isoforms of the same molecule and assign them the same ID. We consider tautomers, different protonation state and mesomers the same at a particular stage of the registration process. At this level you have to identify that the structures below are identical:
Registration system in mcule is primarily based on InChI, which is a very powerful molecule identifier particularly effective for molecule database registration. While we highly recommend its usage in general, we try to focus here on issues the currently available 1.03 version of InChI can't handle perfectly and so it might need to be combined with other tools to improve the registration protocol. We also want to emphasize that InChI is a great tool, but it is also important to talk about its weaknesses (advantages are well described in the InChI technical manual :)).
One of the purposes of InChI is to identify different tautomeric forms of the same molecule and assign the same identifier to them. This is a significant advantage over the SMILES format (which is rather suitable for recording a particular tautomer and protonation state of a molecule) making InChI preferable for molecule database registration. Most tautomers will get the same InChI due to its comprehensive (prototropic) tautomer recognition rules. But it also has some limitations that should be kept in mind. One such issue mentioned in the InChI technical manual is hard/simple (de)protonation of charged groups. Here is the corresponding part of the InChI technical manual:
"When certain negatively charged heteroatoms are present or additional work is required for complete ‘hard’ proton addition/removal, step 6 discovers additional patterns of exchanging hydrogen atoms and charges. In some compounds, resolving these ambiguities results in an increased ‘mobility’ of H-atoms relative to conventional tautomeric rules.”
In other words, two differently charged compounds (as on the picture above - taken from the InChI technical manual) might get different InChIs. Differentiating between these “hard” and “simple” cases depending on simple or more complex tautomeric rules are required might make sense in general. For molecule database registration, however, this behaviour is not beneficial as these molecules should get the same ID. Applying the “hard” tautomeric rules for all molecules should fix this problem (not supported by InChI yet).
It is also important to mention that the latest InChI version (1.03) supports keto-enol and 1,5-tautomerism, although they are not set by default and applying them results in non-standard InChIs. As pointed out by Markus Sitzmann: incorporating these rules (and any others) into the standard InChI would require the introduction of InChI version 2.
One possible way to account for extraordinary tautomers is to complement the automatic registration with humanoid checks in suspicious cases. This way not only ambiguities caused by the hard/simple (de)protonation can be recognized, but new tautomeric rules (not coded in InChI) may be identified as well.
Identifying protonation states
As for the tautomers, InChI might be used with some limitations for identifying different protonation states of molecules as well. InChI follows a standardization approach here by trying to neutralize molecules, but it fails to remove charge from certain groups. This is due to an incomplete acid/base definition recognizing only the most commonly used acids/bases (giving rise to other problems, e.g. imperfect salt identification). Some examples are given below:
As you can see above the neutral and the charged forms have different InChIs. The main problem is that the above InChIs differ not only in their /q layer, but there are differences in the hydrogen count of the chemical formula as well. This means that:
i) molecules in different protonation states will get different chemical formula - conflicting with the InChI technical manual:
“d. Charges and Protons
The charge and proton layer are independent of all others and may be simply removed to eliminate dependence on charge or degree of protonation or deprotonation. Therefore, in a comparison to find identical compounds, do not consider these layers in a comparison unless you wish to distinguish different charge and protonation states.”
ii) in these examples it looks that the chemical formula depends on the SUBSEQUENT charge (/q) layer – conflicting with the InChI technical manual:
“…On the other hand, layers do not depend on successive layers.”
Some of these issues are quite serious, I think. For example tetrazoles occur quite frequently in molecule databases and they are typically negatively charged at pH=7. Depending on the preference of the drawer (neutral or physiologically relevant protonation state of) the same compound can get different InChIs.
I quickly checked ChemSpider and eMolecules databases and they also handle the two protonation states as two different molecules. Here are the two links for ChemSpider:
http://www.chemspider.com/RecordView.aspx?rid=986071e3-04c9-4b1f-8eff-a476d679a2b5 (negatively charged)
Solution to the above problem would be the improvement of the acid definition used for neutralization in InChI, although this will probably only happen in the next InChI version.
For a more complete identification of protonation states, it might make sense to combine InChI with a "neutralizer" that works at least for the examples above. We follow this strategy during molecule registration in mcule.
From the point of view of molecule registration, mesomeric representations of the same molecule should be identified and given the same ID.
Some of the most frequently occurring dipolar mesomeric examples can be found here from among which nitro and azide groups are shown on the left.
InChI accounts for a comprehensive list of dipolar mesomers, normalize them and generates the same InChIs. We haven't seen InChI problems with such mesomers so far, but suspicious cases (molecules with the same chemical formula) might be worth to monitor during registration.
We would like to emphasize that tautomerism, protonation and mesomerism were only considered from the point of view of molecule registration here. For some searching purposes other considerations/tools should be applied to get the expected results (a topic for another post).
As before, comments on the above problems or others not mentioned in this post are welcome! Solutions (free or commercial) are highly appreciated too!