Mcule: 2011

Thursday, November 17, 2011

mcule at GCC

We just returned from the 2011 GCC conference which was held in Goslar, Germany. The atmosphere and all the talks, discussions were very inspiring, many thanks to the Organizers! Now we are back at the mcule HQ with lots of new ideas!

mcule poster

We are now (even more than before) assured that mcule is going to be a useful service for all those interested in doing rational drug discovery. It was great to see that several people have already heard about the things we are working on. We had discussions with modeling software developers about integration of their tools into mcule and also with molecule vendors about making their compounds available in the mcule vendor database.

GCC was also a great opportunity to meet our potential users and we got very useful feedback about their needs, which we'll take into account during development.

BTW, we are working hard to make the first release available for testing this year to those giving their email address on the launch page. So keep checking your inbox!

You may take a closer look at the poster here:

Wednesday, September 7, 2011

Double bond stereochemistry? Not yet solved...

Representing stereochemistry in 2D is not a trivial task. Otherwise, this document wouldn't be that long. :) There are (in my opinion) too many conventions, which makes large-scale, automatic recognition a nightmare. Even though some people claim that cheminformatics was solved, here is a problem to those people who think we are not quite done yet, and there are some issues left to be solved:

How to indicate double bond stereochemistry and how to store this information in MDL MOL files (v2000) correctly?

Even for a single tetrahedral stereocentre there are numerous ways to represent stereochemical information in 2D. Nevertheless, the following bond types are generally used for the most typical cases:

Wedge bonds can indicate a single isomer or a racemic mixture. In drawings, they are typically distinguished by "ABS" or "absolute" labels in case of single isomers and "RAC" or "racemic" labels in case of racemic mixtures.

In v2000 MOL files, all these bond types have respective bond stereo types. According to the MOL file spec the stereochemical information should be stored in the bond block rather than in the atom block, which means that storing the stereo bond types is sufficient. Single isomers with absolute configuration can be distinguished from racemic mixtures by turning on the chiral flag.

Great. Now let’s see what possibilities we have for representing double bond stereochemistry (for clarity, I skipped some linear representations only applicable when the double bond has exactly two substituents, one on each end):

From the picture above one thing is clear: some representations are missing. From the discussion on the Blue Obelisk Exchange forum, people (at least those commented on this question) seems to prefer the crossed double bond representation rather than introducing a wavy bond next to the double bond in question. OK. So, let’s forget about the wavy bonds. But have we decided what the crossed double bond will be used for? Egon Willighagen correctly pointed out that we should distinguish between unknown and undefined stereochemistry. So let’s first decide what we mean on these two words. Here is my go:

Unknown double bond stereo: configuration can be cis OR trans. We don’t know which one, but it is certainly one of them, and the sample is not a mixture of the two.

Undefined double bond stereo: we don’t know anything about the configuration. Can be cis, can be trans, can be a mixture of the two.

Now, which of the two would be best represented by a crossed double bond? I honestly don’t know, but it seems that the crossed double bond is associated with the unknown case (at least ChemWriter and MarvinSketch use it that way – they both write out “either” bond stereo type for crossed double bond in MOL files).

For undefined double bond stereo, Marvin introduced another bond type (see picture on the left). Another problem arising from this new bond type is how to store it e.g. in v2000 MOL files. Since the bond stereo type for double bonds is limited to the absolute and unknown cases, there is no way to directly indicate this in the bond block without violating the MOL file spec. Marvin has a clever solution for this by storing this information as an extra “M” line in the properties block:

M MRV CTU 1 1 (number of undefined double bonds in the connection table followed by their identifiers).

Unfortunately, we are not done yet… what about a sample containing a mixture of cis AND trans? I’m not aware of any representation for this case. Except for drawing both isomers with additional explanatory text. This is exactly what IUPAC suggests for mixtures, but I don’t think many cheminformaticians will like this idea…neither do I.

In principle, such a bond type for mixtures (cis AND trans) could be added like the undefined bond type in Marvin without any problems.

Another suggestion came from developers of Ketcher was to apply a system of labels on plain double bonds. That way unknown, undefined and mixture cases could be represented by using “OR”, “?”, “AND” labels, respectively. I personally like this idea as it somewhat goes into the direction of the v3000 MOL format. The only issue I see here is that these labels has to be carefully positioned to make sure they are not meant to indicate stereochemistry at tetrahedral stereocentres.

I would be very much interested to hear other people's opinion about the above suggestions, and also to hear other possible solutions for the problem.

Sunday, August 14, 2011

Same or different molecules - what do you think?

We've recently done some extensive testing of currently available cheminformatic tools to decide which ones could be integrated to mcule (results of these tests will be published soon in this blog!). One of the simplest tests was reading and writing SD files with a particular tool and check if there is any information loss/excess during conversion. Since the registration system of mcule is primarily based on InChI, we generated InChIs from both input and output SD files and looked for differences. While we identified a number of tool-related issues, we also found some interesting cases where the input and output molecules were identical per see, but they got different InChIs.

Five most common issues with molecular database registration systems. Part 2: Isomer detection

After breaking down multiple component entries to single components, the next step is to analyze these components in more detail and identify different isoforms/representations of the same molecule to make sure the same compounds get the same ID, and different compounds get different IDs. This requires the detection of isomers.

Part 2: Isomer detection

The required level of handling isomers depends on the purpose of the database, and it is worth to decide which level is the best in your case. In short you have to decide what is the same and what is different. In mcule, we are building a vendor database comprising of many millions of molecules. These molecules were drawn by different individuals following different conventions and using different molecule sketchers inevitable resulting in different representations of the same molecule. For our database, we have to identify different isoforms of the same molecule and assign them the same ID. We consider tautomers, different protonation state and mesomers the same at a particular stage of the registration process. At this level you have to identify that the structures below are identical:

Five most common issues with molecular database registration systems. Part 1: Multiple components

Developing a molecule database is not a trivial task. There are several problems you have to deal with and they can be handled differently depending on the size, content and purpose of the database. Here I try to provide a list of the most common issues we came across recently. First: multiple components.

Part 1. Multiple components.

It usually makes sense to handle single components separately even if samples are provided as multiple components in the source data format. This enables structure searching/filtering at the level of single components. Also, you always have to be suspicious with multiple component entries. The sample might be a mixture of individual, equally important components, but it is also possible that the components are in a particular relationship. It is crucial to analyze this relationship before registering any of the components. Entries coming from vendor companies might contain the followings: salt counter ions, solvent molecules, additives (e.g. antioxidants), contaminants, intermediers and different stereoisomers. Some interesting examples you can see here:

the road to mcule

As a first post in the mcule blog we thought it was a good idea to explain our motivations and how the original idea of mcule came. It all started with the observation of an unmet need. Well, it was actually an unmet need of the storyteller :)

I started my PhD in 2004. As a subject my supervisor suggested me the histamine H4 receptor – a novel and very interesting drug target. Only a few selective H4 ligands were known at that time, so any new ligands would have been of great value. We had good experiences with histamine receptor homology models in the past, so we decided to build a structural model for the H4 receptor, and screen compounds virtually in the hope of finding new H4 ligands. After selecting one suitable model for screening, we were facing the first major problem: how could we screen all commercially available compounds? While the ZINC database offered screening libraries of a few vendor companies it only represented a small portion of the purchasable chemical space and most of the libraries were out-of-date. Still, ZINC was a great help and served as a starting point for our screening database. We updated some libraries directly form the vendors and also added some new libraries not included in ZINC. This took, however, a long time (seeking vendors, registering on their website, waiting for passwords, downloading libraries or waiting for their CD to arrive, etc.). Preparation of the newly added compounds (prediction of protonation states, 3D coordinate generation, ID generation, etc.) also took a while but after a few more weeks the database was ready for screening.

Oldalak