Silverlight PivotViewer

I had 5 min to spare so I took a look at the new Silverlight PivotViewer control with the ACD collection that I created a couple of blog posts ago.

image 

As I am on the limit of what I can do with a static collection, the next step is to move to a dynamic data collection based on a molecule structure or name search entered by the user.

Retrieving CML from Word Documents

In addition to working on mining information from some xml files I needed to pick out some structure information from a set of word documents. I have been playing around with Chem4Word and although I have to admit I find it a little flaky and it does make my Word 2007 application slow to start, Chem4Word has got some interesting ideas in terms of its user interface. I also like the idea of augmenting office documents with chemically (and biologically) aware data tags and metadata.

You do have to question why not Chem4Office, I mean as much as scientists like Word I find that they love Excel (for its pivoting and analysis capabilities) and PowerPoint (for those all important presentations to ones peers) much more. Presentation quality graphics in PowerPoint has become a major requirement over the last couple of years.

Rather than use the Office Interop assemblies I decided to look at the OpenXml SDK libraries which allow you to parse and examine (and if you so wish make changes to) the contents of office documents. Using the OpenXml  libraries to open and parse the document to retrieve the xml information proved very simple.


//open the word document
using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(this.WordDocumentPath, false))
{
    // get the main part of the document
        MainDocumentPart mainPart = wordDocument.MainDocumentPart;

    // the cml information is stored as a CustomXmlPart
        IEnumerable<CustomXmlPart> enumerableCustomParts = mainPart.GetPartsOfType<CustomXmlPart>();

    // read the contents of the xml part and determine whether it is a cml format
        foreach (CustomXmlPart customXmlPart in enumerableCustomParts)
        {
            using (StreamReader reader = new StreamReader(customXmlPart.GetStream()))
                {
                    // xml processing to get the names from the cml file
                        string xml = reader.ReadToEnd();
                        // check whether the root is cml
                        if (xml.StartsWith("<cml"))
                        {
                            // its cml parse the cml and return the names from the file 
                            XDocument document = XDocument.Parse(xml);

            .....

Once I got the names I used Symyx Direct “name to structure” feature to generate a structure I could register in my database. I thought about creating a Cml to Molfile convertor but as I only have a small number of documents and the structures are fairly simple the name to structure conversion worked for what I needed.

Chemical Structure searches with Lucene

The other day I was trying to search some xml documents that included embedded chemical structures. I also wanted to search for other data associated with the xml documents which included large sections of textual data. Given that these document also represented test sets that I was using for some of my unit tests it was a safe bet that I would need to search the documents several times for different information. I decided to write a small set of utilities that would allow me to find the information that I wanted.

I thought about loading the documents into a databases but decided that this was way too much overhead for what seemed like a simple task. As I have also been thinking about indexers and document crawlers recently I looked at enabling the Lucene search engine library to structure search the documents.

First I create a wrapper class that retrieved the information I wanted to index from the xml document. The xml file contents is loaded into an xml.Document object and then I used XPath expressions to return the information. In line 33 the doc object is the xml document containing the xml from the file.

public class FileParser {
private static String XPATH_MOLFILE = "//Molecule";
/**
* Constructor
*/   
public FileParser() {}
     /**
     * Loads the file contents
     * @param file the file to load    <br />
     * @throws IOException    <br />
     * @throws SAXException    <br />
     * @throws ParserConfigurationException    <br />
     */
     public void LoadFile(File file) throws IOException, SAXException, ParserConfigurationException {
         xml = readFileContents(file);
         DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
         factory.setNamespaceAware(true);
         DocumentBuilder builder = factory.newDocumentBuilder();
         doc = builder.parse(new InputSource(new StringReader(xml)));
         xPathfactory = XPathFactory.newInstance();
         init = true;
    }
    /**
    * Returns the molfile from the document
    * @return the molfile
    * @throws Exception
    */
    public String getMolfile() throws Exception{
        String ret = null;
        if(init){
            XPath xpath = xPathfactory.newXPath();
            XPathExpression expr = xpath.compile(XPATH_MOLFILE);
            Object result = expr.evaluate(doc, XPathConstants.STRING);
            ret = result.toString();
        }else{
            throw new Exception(&quot;Not init&quot;);
        }
        return ret;
    }

Once I had abstracted the code to return the content from the xml file I implemented code to generate the Lucene document that would contain the information that I wanted indexed.

public static Document Document(File f) throws java.io.FileNotFoundException {
     // make a new, empty document
    Document doc = new Document();
    try {
        FileParser parser = new FileParser();
        parser.LoadFile(f);
        doc.add(new Field("path", f.getPath(), Field.Store.YES, Field.Index.NOT_ANALYZED));
        doc.add(new Field("id", parser.getID(), Field.Store.YES, Field.Index.NOT_ANALYZED));
        System.out.println(parser.getID());
        String nemaKey = MoleculeUtils.generateNemaKey(parser.getMolfile()).toLowerCase();
        System.out.println(nemaKey);
        doc.add(new Field("nemakey", nemaKey, Field.Store.YES, Field.Index.NOT_ANALYZED));
        String sssKeys = MoleculeUtils.generateSSSKey(parser.getMolfile());
        System.out.println(sssKeys);
        doc.add(new Field("ssskey", sssKeys, Field.Store.YES, Field.Index.ANALYZED));
        doc.add(new Field("modified",
             DateTools.timeToString(f.lastModified(), DateTools.Resolution.MINUTE),
             Field.Store.YES, Field.Index.NOT_ANALYZED));
        doc.add(new Field("contents", new FileReader(f)));
    }
    catch(Exception e)
    {
        e.printStackTrace();
    }
    // return the document
    return doc;
}

To enable the exact and substructure searching I needed to generate the structure keys that I could index. To generate these keys I used Cheshire to generate a NEMAKey for exact searching and SSS keys for substructure searching.

public class MoleculeUtils {
public static String generateNemaKey(String molFile) throws UnsatisfiedLinkError, CheshireException{
         String ret = null;
         
        Cheshire cheshire = new Cheshire();
         cheshire.setTargetChemicalObject(molFile);
         if(cheshire.runScript("List(M_NEMAKEYCOMPLETE)"))
         {
              ret = cheshire.getScriptResult();
         }
         return ret;
    }
public static String generateSSSKey(String molFile) throws UnsatisfiedLinkError, CheshireException{
         String ret = null;

         Cheshire cheshire = new Cheshire();
         cheshire.setTargetChemicalObject(molFile);
         if(cheshire.runScript("SSKeys(SSKEYS_2DSUBSET, SSKEYS_INDEX, ' ')")) {
             ret = cheshire.getScriptResult();
         }
      return ret;
}

Then once the structures within the files are converted and indexed I wrote a small command line utility that allowed me to search the files. When I enter a structure into the utility it converts the structure into the same key format as the indexer generated and then lets Lucene handle the string comparison.

Even for the relatively small data set the time saved in being able to find the test sets that I wanted has more than made up for the time taken to write the tools.

You could imaging that the same approach working for crawling large document datasets, generating indexes that are split across multiple machines then searching across those machines to generate a full result set of documents that match the search. Perhaps also providing a simple front end that allowed the user to enter structures via a drawing tool (sounds like the start of a series of blogs).

Pivot

After taking a look at some of the online demo’s for Microsoft Live Labs Pivot and reading through some of the documentation I decided to see whether I could load data (about 3000 molecules) from our Available Chemical Database (ACD) into the Pivot client. 

Here is an image of the final collection.

ACDCollection

First thing I needed to generate the collection was to create images for the chemical structures.  For this I used the Headless Renderer control that comes with Symyx Draw. After creating a simple application that crawled the database and wrote the resulting images to the file system I  hit a snag when I found trying to create the collections from an Excel spreadsheet tool much too slow.

Looking through some web articles I came across the Deep Zoom Tools. I then created a small application using the Deep Zoom Tools that loaded the structure images and generated the collection files.

Once this had completed I wrote another application to generate the .cxml file from the data within the database (as much as I love Xml Spy generating Xml for all but the smallest of files seems such a waste of time).

MolWeightPivot

Given the size of the cxml file generated and the number of images that I generated I was surprised how fast and responsive the client was (even over the WAN).

Although this approach works for small collections there are several issues with the tools that make generating very large static collections impossible. I would therefore recommend the use of dynamic collections and linked dynamic collections if I wanted to generate a “production ready version”. I am looking forward to revisiting the Pivot client when the Silverlight version becomes available. Top of my list of things to try out is whether I can run a structure search using the JDraw (assuming the client has the necessary API) to restrict the initial size of the (dynamic) collection and then link through to other collections.