How do I convert RTF to plain text?

17,789

Solution 1

Take a look at this example, code extracted for preservation.

UPDATED -- copy and paste error from a VB.NET program -- sorry folks.

class ConvertFromRTF
{
    static void Main()
    {

        string path = @"test.rtf";

        //Create the RichTextBox. (Requires a reference to System.Windows.Forms.dll.)
        using(System.Windows.Forms.RichTextBox rtBox = new System.Windows.Forms.RichTextBox());
        {

            // Get the contents of the RTF file. Note that when it is 
           // stored in the string, it is encoded as UTF-16. 
            string s = System.IO.File.ReadAllText(path);

            // Convert the RTF to plain text.
            rtBox.Rtf = s;
            string plainText = rtBox.Text;

            // Now just remove the new line constants
            plainText = plainText.Replace("\r\n", ",");

            // Output plain text to file, encoded as UTF-8.
            System.IO.File.WriteAllText(@"output.txt", plainText);
        }
    }
}

Solution 2

How to: Convert RTF to Plain Text (C# Programming Guide)

In the .NET Framework, you can use the RichTextBox control to create a word processor that supports RTF and enables a user to apply formatting to text in a WYSIWIG manner.

You can also use the RichTextBox control to programmatically remove the RTF formatting codes from a document and convert it to plain text. You do not need to embed the control in a Windows Form to perform this kind of operation.

Share:
17,789

Related videos on Youtube

Jason94
Author by

Jason94

Feed me technology!

Updated on September 15, 2022

Comments

  • Jason94
    Jason94 about 1 year

    I've been given a rather large excel file that per line contains one clob dump from our oracle database, one of them might look like this:

    {\rtf1\ansi\deff0\deftab708{\fonttbl{\f0\fnil\fcharset0 Courier New;}{\f1\fnil\fcharset0 Arial;}{\f2\fnil\fcharset0 MS Sans Serif;}{\f3\fnil\fcharset0 Times New Roman;}{\f4\fnil\fcharset238 Times New Roman CE;}{\f5\fnil\fcharset204 Times New Roman Cyr;}{\f6\fnil\fcharset161 Times New Roman Greek;}{\f7\fnil\fcharset162 Times New Roman Tur;}{\f8\fnil\fcharset186 Times New Roman Baltic;}}{\colortbl\red0\green0\blue0;\red255\green0\blue0;\red0\green0\blue255;\red0\green255\blue255;\red0\green255\blue0;\red255\green0\blue255;\red128\green0\blue128;\red255\green255\blue0;\red255\green255\blue255;\red0\green0\blue128;\red0\green128\blue128;\red0\green128\blue0;\red128\green128\blue0;\red128\green0\blue0;\red128\green128\blue128;\red255\green255\blue255;}\paperw11906\paperh16838\margl1417\margr1417\margt1417\margb1417{\*\pnseclvl1\pnucrm\pnstart1\pnhang\pnindent720{\pntxtb}{\pntxta{.}}}{\*\pnseclvl2\pnucltr\pnstart1\pnhang\pnindent720{\pntxtb}{\pntxta{.}}}{\*\pnseclvl3\pndec\pnstart1\pnhang\pnindent720{\pntxtb}{\pntxta{.}}}{\*\pnseclvl4\pnlcltr\pnstart1\pnhang\pnindent720{\pntxtb}{\pntxta{)}}}{\*\pnseclvl5\pndec\pnstart1\pnhang\pnindent720{\pntxtb{(}}{\pntxta{)}}}{\*\pnseclvl6\pnlcltr\pnstart1\pnhang\pnindent720{\pntxtb{(}}{\pntxta{)}}}{\*\pnseclvl7\pnlcrm\pnstart1\pnhang\pnindent720{\pntxtb{(}}{\pntxta{)}}}{\*\pnseclvl8\pnlcltr\pnstart1\pnhang\pnindent720{\pntxtb{(}}{\pntxta{)}}}{\*\pnseclvl9\pnlcrm\pnstart1\pnhang\pnindent720{\pntxtb{(}}{\pntxta{)}}}{\pard\ql\li0\fi0\ri0\sb0\sl\sa0 \plain\f3\fs24\cf0 FOO FOO FOO \'85\'85. \'85\'85..}}
    

    Now, by putting this data in a System.Windows.Forms.RichTextBox's .Rtf and then read out its .Text value I get a simple conversion. BUT, somehow it brings along its newlines.

    I've tried removing them by

    rtf.Replace("\n", "").Replace("\r", "").Replace(Environment.NewLine, "")

    But It does not seem to help.

    Does anyone know how I can convert the rich text format to a single line plain text?

    • Matt Burland
      Matt Burland about 11 years
      Are you trying to do the replacement on the original rtf or on the plain string from the RichTextBox.Text?
  • Mike Perrenoud
    Mike Perrenoud about 11 years
    This is close, but doesn't fully accomplish the OP's needs, please see my answer.
  • L.B
    L.B about 11 years
    Where is ControlChars defined? and OP says he/she already tried to replace \n and \r.
  • Mike Perrenoud
    Mike Perrenoud about 11 years
    @L.B, you need to replace them together as a grouping -- or at least that's what I found.
  • Mike Perrenoud
    Mike Perrenoud about 11 years
    @L.B, it was a copy and paste error from a VB.NET program and I mentioned that now in my answer. Sorry for any confusion or frustration.
  • Jason94
    Jason94 about 11 years
    @Mike I'm still having problems, but rtf = rtf.Replace("\n", "").Replace("\r", "").Replace(Environment.NewLine, "").Replace("\\par", ""); seems to almoste solve the problem. What is \par?
  • wal
    wal over 10 years
    you really should dispose of the RichTextBox