Extracting text from a contentEditable div

27,072

Solution 1

I forgot about this question until now, when Nico slapped a bounty on it.

I solved the problem by writing the function I needed myself, cribbing a function from the existing jQuery codebase and modifying it to work as I needed.

I've tested this function with Safari (WebKit), IE, Firefox and Opera. I didn't bother checking any other browsers since the whole contentEditable thing is non-standard. It is also possible that an update to any browser could break this function if they change how they implement contentEditable. So programmer beware.

function extractTextWithWhitespace(elems)
{
    var lineBreakNodeName = "BR"; // Use <br> as a default
    if ($.browser.webkit)
    {
        lineBreakNodeName = "DIV";
    }
    else if ($.browser.msie)
    {
        lineBreakNodeName = "P";
    }
    else if ($.browser.mozilla)
    {
        lineBreakNodeName = "BR";
    }
    else if ($.browser.opera)
    {
        lineBreakNodeName = "P";
    }
    var extractedText = extractTextWithWhitespaceWorker(elems, lineBreakNodeName);

    return extractedText;
}

// Cribbed from jQuery 1.4.2 (getText) and modified to retain whitespace
function extractTextWithWhitespaceWorker(elems, lineBreakNodeName)
{
    var ret = "";
    var elem;

    for (var i = 0; elems[i]; i++)
    {
        elem = elems[i];

        if (elem.nodeType === 3     // text node
            || elem.nodeType === 4) // CDATA node
        {
            ret += elem.nodeValue;
        }

        if (elem.nodeName === lineBreakNodeName)
        {
            ret += "\n";
        }

        if (elem.nodeType !== 8) // comment node
        {
            ret += extractTextWithWhitespace(elem.childNodes, lineBreakNodeName);
        }
    }

    return ret;
}

Solution 2

Unfortunately you do still have to handle this for the pre case individually per-browser (I don't condone browser detection in many cases, use feature detection...but in this case it's necessary), but luckily you can take care of them all pretty concisely, like this:

var ce = $("<pre />").html($("#edit").html());
if($.browser.webkit) 
  ce.find("div").replaceWith(function() { return "\n" + this.innerHTML; });    
if($.browser.msie) 
  ce.find("p").replaceWith(function() { return this.innerHTML  +  "<br>"; });
if($.browser.mozilla || $.browser.opera ||$.browser.msie )
  ce.find("br").replaceWith("\n");

var textWithWhiteSpaceIntact = ce.text();

You can test it out here. IE in particular is a hassle because of the way is does &nbsp; and new lines in text conversion, that's why it gets the <br> treatment above to make it consistent, so it needs 2 passes to be handled correctly.

In the above #edit is the ID of the contentEditable component, so just change that out, or make this a function, for example:

function getContentEditableText(id) {
    var ce = $("<pre />").html($("#" + id).html());
    if ($.browser.webkit)
      ce.find("div").replaceWith(function() { return "\n" + this.innerHTML; });
    if ($.browser.msie)
      ce.find("p").replaceWith(function() { return this.innerHTML + "<br>"; });
    if ($.browser.mozilla || $.browser.opera || $.browser.msie)
      ce.find("br").replaceWith("\n");

    return ce.text();
}

You can test that here. Or, since this is built on jQuery methods anyway, make it a plugin, like this:

$.fn.getPreText = function () {
    var ce = $("<pre />").html(this.html());
    if ($.browser.webkit)
      ce.find("div").replaceWith(function() { return "\n" + this.innerHTML; });
    if ($.browser.msie)
      ce.find("p").replaceWith(function() { return this.innerHTML + "<br>"; });
    if ($.browser.mozilla || $.browser.opera || $.browser.msie)
      ce.find("br").replaceWith("\n");

    return ce.text();
};

Then you can just call it with $("#edit").getPreText(), you can test that version here.

Solution 3

I discovered this today in Firefox:

I pass a contenteditable div who's white-space is set to "pre" to this function, and it works sharply.

I added a line to show how many nodes there are, and a button that puts the output into another PRE, just to prove that the linebreaks are intact.

It basically says this:

For each child node of the DIV,
   if it contains the 'data' property,
      add the data value to the output
   otherwise
      add an LF (or a CRLF for Windows)
}
and return the result.

There is an issue, tho. When you hit enter at the end of any line of the original text, instead of putting a LF in, it puts a "Â" in. You can hit enter again and it puts a LF in there, but not the first time. And you have to delete the "Â" (it looks like a space). Go figure - I guess that's a bug.

This doesn't occur in IE8. (change textContent to innerText) There is a different bug there, tho. When you hit enter, it splits the node into 2 nodes, as it does in Firefox, but the "data" property of each one of those nodes then becomes "undefined".

I'm sure there's much more going on here than meets the eye, so any input on the matter will be enlightening.

<!DOCTYPE html>
<html>
<HEAD>
<SCRIPT type="text/javascript">
    function htmlToText(elem) {
        var outText="";
        for(var x=0; x<elem.childNodes.length; x++){
            if(elem.childNodes[x].data){
                outText+=elem.childNodes[x].data;
            }else{
                outText+="\n";
            }
        }
        alert(elem.childNodes.length + " Nodes: \r\n\r\n" + outText);
        return(outText);
    }
</SCRIPT>
</HEAD>
<body>

<div style="white-space:pre;" contenteditable=true id=test>Text in a pre element
is displayed in a fixed-width
font, and it preserves
both      spaces and
line breaks
</DIV>
<INPUT type=button value="submit" onclick="document.getElementById('test2').textContent=htmlToText(document.getElementById('test'))">
<PRE id=test2>
</PRE>
</body>
</html>

Solution 4

see this fiddle

Or this post

How to parse editable DIV's text with browser compatibility

created after lot of effort...........

Share:
27,072

Related videos on Youtube

Shaggy Frog
Author by

Shaggy Frog

Remote: Yes Technologies: Everything/whatever. Last 3-4 years: Python (Django), Kotlin, Java (Spring), JavaScript. Also last 11 years: iOS/OSX (C/C++/Objective-C), Bash, Perl. MSc in AI (heuristic search). LinkedIn: https://www.linkedin.com/in/thomashauk Hacker News: https://news.ycombinator.com/user?id=shaggyfrog First Computer: Atari 800 (Logo and BASIC) Cake: YES Copy &amp; Paste Messages: no thank you I'm a proud generalist. I add value. I care about my work. I get stuff done. I deliver. Send me an e-mail!

Updated on June 15, 2020

Comments

  • Shaggy Frog
    Shaggy Frog almost 4 years

    I have a div set to contentEditable and styled with "white-space:pre" so it keeps things like linebreaks. In Safari, FF and IE, the div pretty much looks and works the same. All is well. What I want to do is extract the text from this div, but in such a way that will not lose the formatting -- specifically, the line breaks.

    We are using jQuery, whose text() function basically does a pre-order DFS and glues together all the content in that branch of the DOM into a single lump. This loses the formatting.

    I had a look at the html() function, but it seems that all three browsers do different things with the actual HTML that gets generated behind the scenes in my contentEditable div. Assuming I type this into my div:

    1
    2
    3
    

    These are the results:

    Safari 4:

    1
    <div>2</div>
    <div>3</div>
    

    Firefox 3.6:

    1
    <br _moz_dirty="">
    2
    <br _moz_dirty="">
    3
    <br _moz_dirty="">
    <br _moz_dirty="" type="_moz">
    

    IE 8:

    <P>1</P><P>2</P><P>3</P>
    

    Ugh. Nothing very consistent here. The surprising thing is that MSIE looks the most sane! (Capitalized P tag and all)

    The div will have dynamically set styling (font face, colour, size and alignment) which is done using CSS, so I'm not sure if I can use a pre tag (which was alluded to on some pages I found using Google).

    Does anyone know of any JavaScript code and/or jQuery plugin or something that will extract text from a contentEditable div in such a way as to preserve linebreaks? I'd prefer not to reinvent a parsing wheel if I don't have to.

    Update: I cribbed the getText function from jQuery 1.4.2 and modified it to extract it with whitespace mostly intact (I only chnaged one line where I add a newline);

    function extractTextWithWhitespace( elems ) {
        var ret = "", elem;
    
        for ( var i = 0; elems[i]; i++ ) {
            elem = elems[i];
    
            // Get the text from text nodes and CDATA nodes
            if ( elem.nodeType === 3 || elem.nodeType === 4 ) {
                ret += elem.nodeValue + "\n";
    
            // Traverse everything else, except comment nodes
            } else if ( elem.nodeType !== 8 ) {
                ret += extractTextWithWhitespace2( elem.childNodes );
            }
        }
    
        return ret;
    }
    

    I call this function and use its output to assign it to an XML node with jQuery, something like:

    var extractedText = extractTextWithWhitespace($(this));
    var $someXmlNode = $('<someXmlNode/>');
    $someXmlNode.text(extractedText);
    

    The resulting XML is eventually sent to a server via an AJAX call.

    This works well in Safari and Firefox.

    On IE, only the first '\n' seems to get retained somehow. Looking into it more, it looks like jQuery is setting the text like so (line 4004 of jQuery-1.4.2.js):

    return this.empty().append( (this[0] && this[0].ownerDocument || document).createTextNode( text ) );
    

    Reading up on createTextNode, it appears that IE's implementation may mash up the whitespace. Is this true or am I doing something wrong?

    • Yahel
      Yahel over 13 years
      Interestingly, not surprising that IE is acting the most sane: contentEditable was originally IE proprietary; it's been in IE since 5.5, so I guess they've had the most time to get it working well.
  • Tim Down
    Tim Down over 13 years
    Ick. As you observe, browser detection is bad. Fortunately, it is avoidable here: see my answer.
  • Nick Craver
    Nick Craver over 13 years
    @Tim - I couldn't get your approach to work in IE or Opera though: jsfiddle.net/UjZEN/3
  • John Robert Allan
    John Robert Allan over 10 years
    this (above fiddle) breaks even in chrome... 1) add 1,2,3,4 on separate lines 2) test, looks ok 3) go to the beginning of line 2, press backspace 4) press enter 5) test - notice lines 2,3,4 are now all on one line
  • John Robert Allan
    John Robert Allan over 10 years
    this also breaks in Chrome - 1) enter 1,2,3,4 on separate lines 2) go back to line 1 3) type a few words 4) go to beginning of line two, press backspace, press enter, press backspace 5) view results, line 2 will have an extra line break after it
  • Oli
    Oli over 9 years
    Works well for me (in FF and Chrome). Haven't evaluated it computationally against the other $.browser options but given Jquery doesn't ship that plugin any more, this was easier to drop in. I'll worry about performance another day :)
  • Amicable
    Amicable almost 9 years
    I use a contenteditable div for the benefits of rendering HTML within it e.g. text highlighting excess characters like twitter. I'm not interested in saving that formatting to my database though.
  • Jon z
    Jon z over 8 years
    @Amicable Did you try the function? Let me know if it seems to work for you. Also be aware that typically w/ a contenteditable element when you copy/paste HTML the formatting is retained - you probably want to do as Twitter does and filter out the markup in this situation.
  • Lukus
    Lukus about 8 years
    Nice clean solution, however, it doesn't work for cases where browser is inconsistent with layers. I.e., chrome does not include a div as the first element when typing but does as soon as you hit enter. I found this solution didn't quite handle that case.
  • Ortund
    Ortund almost 7 years
    Hi! Thanks for your answer and welcome to Stackoverflow. Please have a look at how to answer and try to improve your answer a little bit. Adding an explanation as to how the OP was going wrong or what your code does better helps to improve the quality of your answer.