• Converting .doc to .docx programmatically

    by  • July 25, 2007 • .net, c# • 1 Comment

    I’ve been looking for something that can aid me in converting doc files and other formats to html, or some internal representation that I can do additional parsing on server side, and have come across some interesting applications such as PurePage, ConvertDoc etc. These are great, and have some nice, simple and effective API’s but unfortunately cost just too much for small scale projects. The main issue is converting the .doc binary format. Several attempts exist, including Apache POI, but these are more proof of concept than anything.

    So armed with my copy of Word 2007 I’ve been playing with the COM interface. Looks like we can parse in another way – by extracting the RAW OpenXML from the document. When Word 2007 opens a traditional .doc it converts it to the OpenXML representation and we can simply extract this through the COM interface, then close the instance of Word down, and post-process the XML. Here’s what I did:

    1. added a COM reference in Visual Studio to Word 12.0 Object Library
    2. a little bit of code:
    3. object file = src;     //String containing location to .doc file

      object nullobj = System.Reflection.Missing.Value;

      Microsoft.Office.Interop.Word.Document doc = wordApp.Documents.Open2002(
                  ref file, ref nullobj, ref nullobj,
                  ref nullobj, ref nullobj, ref nullobj,
                  ref nullobj, ref nullobj, ref nullobj,
                  ref nullobj, ref nullobj, ref nullobj, ref nullobj, ref nullobj,ref nullobj);

      doc.ActiveWindow.Selection.WholeStory();     //get entire story

      string xml = doc.ActiveWindow.Selection.get_XML(false);  //get xml corresponding to story

    4. ‘xml’ now contains the OpenXML representation of the .doc file. Dont forget to close the instance of word down with doc.close();

    Now I just need to find an ODF C# library that can help me make sense of the resulting XML string!

    About

    .NET developer at thetrainline.com, previously web developer at MRM Meteorite. Awarded a PhD in misbehaviour detection in wireless ad-hoc networks.A keen C# ASP.net developer bridging the gap with APIs and JavaScript frameworks, one web app at a time.

    http://www.paulkiddie.com

    One Response to Converting .doc to .docx programmatically

    1. Razvan
      February 15, 2011 at 7:18 am

      Ok, so did you find the ODF C# library needed? I have a similar task, I want to take the XML string, pass it to a server and the server should be able to save it as .docx. No luck yet for me.

    Leave a Reply

    Your email address will not be published. Required fields are marked *