Generating and applying diffs in python

19,796

Solution 1

Did you have a look at diff-match-patch from google? Apparantly google Docs uses this set of algoritms. It includes not only a diff module, but also a patch module, so you can generate the newest file from older files and diffs.

A python version is included.

http://code.google.com/p/google-diff-match-patch/

Solution 2

Does difflib.unified_diff do want you want? There is an example here.

Solution 3

I've implemented a pure python function to apply diff patches to recover either of the input strings, I hope someone finds it useful. It uses parses the Unified diff format.

import re

_hdr_pat = re.compile("^@@ -(\d+),?(\d+)? \+(\d+),?(\d+)? @@$")

def apply_patch(s,patch,revert=False):
  """
  Apply unified diff patch to string s to recover newer string.
  If revert is True, treat s as the newer string, recover older string.
  """
  s = s.splitlines(True)
  p = patch.splitlines(True)
  t = ''
  i = sl = 0
  (midx,sign) = (1,'+') if not revert else (3,'-')
  while i < len(p) and p[i].startswith(("---","+++")): i += 1 # skip header lines
  while i < len(p):
    m = _hdr_pat.match(p[i])
    if not m: raise Exception("Cannot process diff")
    i += 1
    l = int(m.group(midx))-1 + (m.group(midx+1) == '0')
    t += ''.join(s[sl:l])
    sl = l
    while i < len(p) and p[i][0] != '@':
      if i+1 < len(p) and p[i+1][0] == '\\': line = p[i][:-1]; i += 2
      else: line = p[i]; i += 1
      if len(line) > 0:
        if line[0] == sign or line[0] == ' ': t += line[1:]
        sl += (line[0] != sign)
  t += ''.join(s[sl:])
  return t

If there are header lines ("--- ...\n","+++ ...\n") it skips over them. If we have a unified diff string diffstr representing the diff between oldstr and newstr:

# recreate `newstr` from `oldstr`+patch
newstr = apply_patch(oldstr, diffstr)
# recreate `oldstr` from `newstr`+patch
oldstr = apply_patch(newstr, diffstr, True)

In Python you can generate a unified diff of two strings using difflib (part of the standard library):

import difflib
_no_eol = "\ No newline at end of file"

def make_patch(a,b):
  """
  Get unified string diff between two strings. Trims top two lines.
  Returns empty string if strings are identical.
  """
  diffs = difflib.unified_diff(a.splitlines(True),b.splitlines(True),n=0)
  try: _,_ = next(diffs),next(diffs)
  except StopIteration: pass
  return ''.join([d if d[-1] == '\n' else d+'\n'+_no_eol+'\n' for d in diffs])

On unix: diff -U0 a.txt b.txt

Code is on GitHub here along with tests using ASCII and random unicode characters: https://gist.github.com/noporpoise/16e731849eb1231e86d78f9dfeca3abc

Solution 4

AFAIK most diff algorithms use a simple Longest Common Subsequence match, to find the common part between two texts and whatever is left is considered the difference. It shouldn't be too difficult to code up your own dynamic programming algorithm to accomplish that in python, the wikipedia page above provides the algorithm too.

Share:
19,796
noio
Author by

noio

Game designer &amp; developer.

Updated on June 02, 2022

Comments

  • noio
    noio about 2 years

    Is there an 'out-of-the-box' way in python to generate a list of differences between two texts, and then applying this diff to one file to obtain the other, later?

    I want to keep the revision history of a text, but I don't want to save the entire text for each revision if there is just a single edited line. I looked at difflib, but I couldn't see how to generate a list of just the edited lines that can still be used to modify one text to obtain the other.

  • noio
    noio over 14 years
    It would have to be a pure python solution because I'd like to deploy it in AppEngine. diff/patch would be ideal, but then in python.
  • noio
    noio over 14 years
    Exactly what I was looking for! I tried googling for different combinations of "python","diff","patch","revision", but hadn't found this yet.
  • NealWalters
    NealWalters over 13 years
    Voted your answer up. The built in difflib seems powerful, yet somewhat confusing, just a matter of getting over the learning curve. See my similar post here: stackoverflow.com/questions/4743359/…
  • Paragon
    Paragon about 12 years
    google-diff-match-patch does seem store the entire file. It saves everything in tuples: (0, 'stuff') represents that 'stuff' is present in both strings. The system is simple enough that it stores literally every character so that it can iterate through them and modify the text as need be.
  • qre0ct
    qre0ct about 11 years
    how can i use this API with Python>? Would be great if it could be illustrated with an example
  • Admin
    Admin over 9 years
    it still stores the entire text in form of tuples mixed with deleted and inserted text.. too verbose and bulky when you have to fit that in ram...
  • Fake Name
    Fake Name over 8 years
    The library has no way to apply the output of difflib.unified_diff. It has diff, but no patch. As such, if you're trying to stay within python, ` difflib.unified_diff ` is useless.
  • Pithikos
    Pithikos over 7 years
    Notice that this kind of computations are generally slow so probably sticking with something lower level would scale better!