Git (or Hg) plugin for dealing with Microsoft Word and/or OpenOffice files

21,898

Solution 1

How about:

  1. Save your Word docs in XML.
  2. Commit your XML Word files.
  3. Diff using an external XML diff tool. For example:

    $ git difftool -t xmldiff c3d293 498571

Transforming the XML files to have one element per line should make the check-in process run efficiently and also allow the external XML diff tool to process quickly.

References:

Solution 2

A nice trick I was able to come up with that also works on Open Office files, PPTs, etc.:

http://xcafebabe.blogspot.hu/2012/09/sexy-comparison-of-word-documents-with.html

Here's a screenshot that demonstrates the result:

enter image description here

Solution 3

If you are on MS Windows, use TortoiseGit. I just had to go through this painful experience, and TGit, although inelegant takes some of the pain out it. A couple of other points:

  • Surprisingly git diff and gitk both do a reasonably good job of at least visualizing diffs between .docx (not sure about .doc, but I would assume it's the same). This is good for just a quick scan of diffs when doing commits.
  • You are completely out of luck as far as fast forward and automerging is concerned. Unfortunately I have not found a tool that can handle this (although I like the xml idea above), so you will have to do all merges manually.
  • Microsoft Word (MS Word) has a decent, if flawed, merge tool. AFAIK, it can only do 2-way merges (i.e.: X0 + dX = X1), not 3-way or 2-parent merges, which are more common in version control (i.e.: X0 + dX1 + dX2 = X1). You could solve merge conflicts using this tool, but there would be some legwork right - checking out each branch, exporting HEAD as an untracked version, etc.

    X0 = *.BASE.docx,
    X0 + dX1 = *.LOCAL.docx and
    X0 + dX2 = *.REMOTE.docx
    
  • Luckily this is exactly what TGit (and TSVN too) do. I would unfortunately, avoid rebase since if you have to replay several changes in a row, it can be very tiring, but merge for short documents is fine, just not great.

Solution 4

Answering JudoWill's question - Workshare is probably leading tool used by Lawyers.

Solution 5

I compiled instructions for multiple places here: http://bit.ly/17LaxVY

# download docx2txt by Sandeep Kumar
wget -O docx2txt.pl http://www.cs.indiana.edu/~kinzler/home/binp/docx2txt

# make a wrapper 
echo '#!/bin/bash
docx2txt.pl $1 -' > docx2txt
chmod +x docx2txt

# make sure docx2txt.pl and docx2txt are your current PATH. Here's a guide
http://shapeshed.com/using_custom_shell_scripts_on_osx_or_linux/
mv docx2txt docx2txt.pl ~/bin/

# set .gitattributes (unfortunately I don't this can't be set by default, you have to create it for every project)
echo "*.docx diff=word" > .git/info/attributes

# add the following to ~/.gitconfig
[diff "word"]
    binary = true
    textconv = docx2txt

# add a new alias
[alias]
    wdiff = diff --color-words

# try it
git init

# create my_file.docx, add some content

git add my_file.docx

git ci -m "Initial commit"

# change something in my_file.docx

git wdiff my_file.docx

# awesome!

It works great on OSX

Share:
21,898
JudoWill
Author by

JudoWill

A python, matlab and django programmer in a bioinformatics PhD program

Updated on June 08, 2020

Comments

  • JudoWill
    JudoWill almost 4 years

    Has anyone come across a Git or Hg plugin for "meaningful" diffs/merging/branching of OpenOffice or Microsoft word files.

    I know I can 'checkin' .doc files but both Git and Hg treat them as binary blobs. I'd like to be able to do all (or at least many) of the normal revision based operations on the text of the file.

    And yes, I do know that I should be using Latex or converting files back-and-forth between RTF. I'm just looking for a more "native" solution since I'm trying to manage collaboration between techies and "management people".

    This is related to my question on Biostar here: http://biostar.stackexchange.com/questions/1749/writing-collaboration-with-source-control-and-microsoft-word

    Thanks.

  • JudoWill
    JudoWill almost 14 years
    I don't expect them to use git or hg ... I expect them to use Word (or something like it) and then I was hoping to use the plugin to facilitate the merging. I'm in an academic institution so I doubt I would be able to afford a custom solution. Out of curiosity though do you have names or links to the "Law Firm" systems?
  • Mark Mikofski
    Mark Mikofski over 11 years
    +1 for screenshot. This is exactly what TGit does! This what I was talking about in my comment above, but you only have to create a new diff/merge-tool if you want to be able to call it directly from git or if you don't have tortoiseXXX. What do mac-folk do? If you do have TGit, then just use your explorer extensions to diff, merge etc. Note that if you use git merge/rebase, it will still fail, and you will still have to merge the word docs manually, which was kind of the original goal. Still looking. NB xml didn't work.
  • rlegendi
    rlegendi over 11 years
    Cool, thx for the clarification! Actually, I just wanted to install TGit :-)
  • Rich
    Rich almost 11 years
    IMO this is the best answer - the linked blog post lets you use TGit's word diffing script without needing to install TGit (which interferes with Cygwin's git by installing msysgit)
  • eric
    eric almost 10 years
    Can that handle revert?
  • Honza Kuchař
    Honza Kuchař almost 8 years
    And best thing is that word can also MERGE documents!
  • JasonPlutext
    JasonPlutext over 6 years
  • TamaMcGlinn
    TamaMcGlinn almost 3 years
    To save some people a lot of clicks; it is $300 per year, free trial available, and compares pdf as well as word, and does OCR on (embedded) images.