docx4j find and replace

16,632

Solution 1

You can use VariableReplace to achieve this which may not have existed at the time of the other answers. This does not do a find/replace per se but works on placeholders eg ${myField}

java.util.HashMap mappings = new java.util.HashMap();
VariablePrepare.prepare(wordMLPackage);//see notes
mappings.put("myField", "foo");
wordMLPackage.getMainDocumentPart().variableReplace(mappings);

Note that you do not pass ${myField} as the field name; rather pass the unescaped field name myField - This is rather inflexible in that as it currently stands your placeholders must be of the format ${xyz} whereas if you could pass in anything then you could use it for any find/replace. The ability to use this also exists for C# people in docx4j.NET

See here for more info on VariableReplace or here for VariablePrepare

Solution 2

Good day, I made an example how to quickly replace text to something you need by regexp. I find ${param.sumname} and replace it in document. Note, you have to insert text as 'text only'! Have fun!

  WordprocessingMLPackage mlp = WordprocessingMLPackage.load(new File("filepath"));
  replaceText(mlp.getMainDocumentPart());

  static void replaceText(ContentAccessor c)
    throws Exception
  {
    for (Object p: c.getContent())
    {
      if (p instanceof ContentAccessor)
        replaceText((ContentAccessor) p);

      else if (p instanceof JAXBElement)
      {
        Object v = ((JAXBElement) p).getValue();

        if (v instanceof ContentAccessor)
          replaceText((ContentAccessor) v);

        else if (v instanceof org.docx4j.wml.Text)
        {
          org.docx4j.wml.Text t = (org.docx4j.wml.Text) v;
          String text = t.getValue();

          if (text != null)
          {
            t.setSpace("preserve"); // needed?
            t.setValue(replaceParams(text));
          }
        }
      }
    }
  }

  static Pattern paramPatern = Pattern.compile("(?i)(\\$\\{([\\w\\.]+)\\})");

  static String replaceParams(String text)
  {
    Matcher m = paramPatern.matcher(text);

    if (!m.find())
      return text;

    StringBuffer sb = new StringBuffer();
    String param, replacement;

    do
    {
      param = m.group(2);

      if (param != null)
      {
        replacement = getParamValue(param);
        m.appendReplacement(sb, replacement);
      }
      else
        m.appendReplacement(sb, "");
    }
    while (m.find());

    m.appendTail(sb);
    return sb.toString();
  }

  static String getParamValue(String name)
  {
    // replace from map or something else
    return name;
  }

Solution 3

I created a library to publish my solution because it's quite a lot of code: https://github.com/phip1611/docx4j-search-and-replace-util

The workflow is the following:

First step:

// (this method was part of your question)  
List<Text> texts = getAllElementFromObject(docxDocument.getMainDocumentPart(), Text.class);

This way we get all actual Text-content in the correct order but without style markup in-between. We can edit the Text-objects (by setValue) and keep styles.

Resulting problem: Search-text/placeholders can be split accoss multiple Text-instances (because there can be style markup that is invisble in-between in original document), e.g. ${FOOBAR}, ${ + FOOBAR}, or $ + {FOOB + AR}

Second step:

Concat all Text-objects to a full string / "complete string"

Optional<String> completeStringOpt = texts.stream().map(Text::getValue).reduce(String::concat);

Third step:

Create a class TextMetaItem. Each TextMetaItem knows for it's Text-object where it's content begins and ends in the complete string. E.g. If the Text-objects for "foo" and "bar" results in the complete string "foobar" than indices 0-2 belongs to "foo"-Text-object and 3-5 to "bar"-Text-object. Build a List<TextMetaItem>

static List<TextMetaItem> buildMetaItemList(List<Text> texts) {
    final int[] index = {0};
    final int[] iteration = {0};
    List<TextMetaItem> list = new ArrayList<>();
    texts.forEach(text -> {
        int length = text.getValue().length();
        list.add(new TextMetaItem(index[0], index[0] + length - 1, text, iteration[0]));
        index[0] += length;
        iteration[0]++;
    });
    return list;
}

Fourth step:

Build a Map<Integer, TextMetaItem> where the key is the index/char in the complete string. This means the map's length equals completeString.length()

static Map<Integer, TextMetaItem> buildStringIndicesToTextMetaItemMap(List<Text> texts) {
    List<TextMetaItem> metaItemList = buildMetaItemList(texts);
    Map<Integer, TextMetaItem> map = new TreeMap<>();
    int currentStringIndicesToTextIndex = 0;
    // + 1 important here! 
    int max = metaItemList.get(metaItemList.size() - 1).getEnd() + 1;
    for (int i = 0; i < max; i++) {
        TextMetaItem currentTextMetaItem = metaItemList.get(currentStringIndicesToTextIndex);
        map.put(i, currentTextMetaItem);
        if (i >= currentTextMetaItem.getEnd()) {
            currentStringIndicesToTextIndex++;
        }
    }
    return map;
}

interim result:

Now you have enough metadata to delegate every action you want to do on the complete string to the corresponding Text object! (To change the content of Text-objects you just need to call (#setValue()) That's all what's needed in Docx4J to edit text. All style info etc will be preserved!

last step: search and replace

  1. build a method that finds all occurrences of your possible placeholders. You should create a class like FoundResult(int start, int end) that stores begin and end indices of a found value (placeholder) in the complete string

    public static List<FoundResult> findAllOccurrencesInString(String data, String search) {
        List<FoundResult> list = new ArrayList<>();
        String remaining = data;
        int totalIndex = 0;
        while (true) {
            int index = remaining.indexOf(search);
            if (index == -1) {
                break;
            }
    
            int throwAwayCharCount = index + search.length();
            remaining = remaining.substring(throwAwayCharCount);
    
            list.add(new FoundResult(totalIndex + index, search));
    
            totalIndex += throwAwayCharCount;
        }
        return list;
    } 
    

    using this I build a new list of ReplaceCommands. A ReplaceCommand is a simple class and stores a FoundResult and the new value.

  2. next you must order this list from the last item to the first (order by position in complete string)

  3. now you can write a replace all algorithm because you know what action needs to be done on which Text-object. We did (2) so that replace operations won't invalidate indices of other FoundResults.

    3.1.) find Text-object(s) that needs to be changed 3.2.) call getValue() on them 3.3.) edit the string to the new value 3.4.) call setValue() on the Text-objects

This is the code that does all the magic. It executes a single ReplaceCommand.

   /**
     * @param texts All Text-objects
     * @param replaceCommand Command
     * @param map Lookup-Map from index in complete string to TextMetaItem
     */
    public static void executeReplaceCommand(List<Text> texts, ReplaceCommand replaceCommand, Map<Integer, TextMetaItem> map) {
        TextMetaItem tmi1 = map.get(replaceCommand.getFoundResult().getStart());
        TextMetaItem tmi2 = map.get(replaceCommand.getFoundResult().getEnd());
        if (tmi2.getPosition() - tmi1.getPosition() > 0) {
            // it can happen that text objects are in-between
            // we can remove them (set to null)
            int upperBorder = tmi2.getPosition();
            int lowerBorder = tmi1.getPosition() + 1;
            for (int i = lowerBorder; i < upperBorder; i++) {
                texts.get(i).setValue(null);
            }
        }

       if (tmi1.getPosition() == tmi2.getPosition()) {
            // do replacement inside a single Text-object

            String t1 = tmi1.getText().getValue();
            int beginIndex = tmi1.getPositionInsideTextObject(replaceCommand.getFoundResult().getStart());
            int endIndex = tmi2.getPositionInsideTextObject(replaceCommand.getFoundResult().getEnd());

            String keepBefore = t1.substring(0, beginIndex);
            String keepAfter = t1.substring(endIndex + 1);

            tmi1.getText().setValue(keepBefore + replaceCommand.getNewValue() + keepAfter);
        } else {
            // do replacement across two Text-objects

            // check where to start and replace 
            // the Text-objects value inside both Text-objects
            String t1 = tmi1.getText().getValue();
            String t2 = tmi2.getText().getValue();

            int beginIndex = tmi1.getPositionInsideTextObject(replaceCommand.getFoundResult().getStart());
            int endIndex = tmi2.getPositionInsideTextObject(replaceCommand.getFoundResult().getEnd());

            t1 = t1.substring(0, beginIndex);
            t1 = t1.concat(replaceCommand.getNewValue());
            t2 = t2.substring(endIndex + 1);

            tmi1.getText().setValue(t1);
            tmi2.getText().setValue(t2);
        }
    }

Solution 4

This can be a problem. I cover how to mitigate broken-up text runs in this answer here: https://stackoverflow.com/a/17066582/125750

... but you might want to consider content controls instead. The docx4j source site has various content control samples here:

https://github.com/plutext/docx4j/tree/master/src/samples/docx4j/org/docx4j/samples

Share:
16,632
luckyi
Author by

luckyi

Updated on June 27, 2022

Comments

  • luckyi
    luckyi almost 2 years

    I have docx document with some placeholders. Now I should replace them with other content and save new docx document. I started with docx4j and found this method:

    public static List<Object> getAllElementFromObject(Object obj, Class<?> toSearch) {
        List<Object> result = new ArrayList<Object>();
        if (obj instanceof JAXBElement) obj = ((JAXBElement<?>) obj).getValue();
    
        if (obj.getClass().equals(toSearch))
            result.add(obj);
        else if (obj instanceof ContentAccessor) {
            List<?> children = ((ContentAccessor) obj).getContent();
            for (Object child : children) {
                result.addAll(getAllElementFromObject(child, toSearch));
            }
        }
        return result;
    }
    
    public static void findAndReplace(WordprocessingMLPackage doc, String toFind, String replacer){
        List<Object> paragraphs = getAllElementFromObject(doc.getMainDocumentPart(), P.class);
        for(Object par : paragraphs){
            P p = (P) par;
            List<Object> texts = getAllElementFromObject(p, Text.class);
            for(Object text : texts){
                Text t = (Text)text;
                if(t.getValue().contains(toFind)){
                    t.setValue(t.getValue().replace(toFind, replacer));
                }
            }
        }
    }
    

    But that only work rarely because usually the placeholders splits across multiple texts runs.

    I tried UnmarshallFromTemplate but it work rarely too.

    How this problem could be solved?