MS Word special characters Regex

Microsoft Word

That pesky Microsoft! They have to be different and mess us developers around don’t they. Have you ever noticed that Microsoft Word’s symbols look a bit different or act a little odd? Well it’s because they are not the standard char characters. This can be a pain for Regex and other things. So how do you get them…

The reason they are so difficult is they use Windows-1252 character encoding set which are not represented in ASCII or ISO-8859-1. This is what just about everyone doesn’t do of course. These characters include:

  • … ellipsis
  • ‘smart’ “quotes”
  • en – dash and em — dash
  • dagger † and double dagger ‡

There are few more of course, but these are the most common few that come up. You can find more of Microsoft Word Windows-1252 character encoding here.

 

Symbol Encoding
single quotes and apostrophe \u2018\u2019\u201A
double quotes \u201C\u201D\u201E
ellipsis \u2026
dashes \u2013\u2014
circumflex \u02C6
open angle bracket \u2039
close angle bracket \u203A
spaces \u02DC\u00A0

 

Here is a pre-built method for JavaScript and C# to combat these.

 

JavaScript Clean String

var wordClean = function(text) {
var cleanStr = text;

// smart single quotes and apostrophe
cleanStr = cleanStr.replace(/[\u2018\u2019\u201A]/g, “\'”);

// smart double quotes
cleanStr = cleanStr.replace(/[\u201C\u201D\u201E]/g, “\””);

// ellipsis
cleanStr = cleanStr.replace(/\u2026/g, “…”);

// dashes
cleanStr = cleanStr.replace(/[\u2013\u2014]/g, “-“);

// circumflex
cleanStr = cleanStr.replace(/\u02C6/g, “^”);

// open angle bracket
cleanStr = cleanStr.replace(/\u2039/g, “<“);

// close angle bracket
cleanStr = cleanStr.replace(/\u203A/g, “>”);

// spaces
cleanStr = cleanStr.replace(/[\u02DC\u00A0]/g, ” “);

return cleanStr ;
}


C# Clean String

public string wordClean (string text){
var cleanStr  = text;

// smart single quotes and apostrophe
cleanStr  = Regex.Replace(s, “[\u2018\u2019\u201A]”, “‘”);

// smart double quotes
cleanStr  = Regex.Replace(s, “[\u201C\u201D\u201E]”, “\””);

// ellipsis
cleanStr  = Regex.Replace(s, “\u2026”, “…”);

// dashes
cleanStr  = Regex.Replace(s, “[\u2013\u2014]”, “-“);

// circumflex
cleanStr  = Regex.Replace(s, “\u02C6”, “^”);

// open angle bracket
cleanStr  = Regex.Replace(s, “\u2039”, “<“);

// close angle bracket
cleanStr  = Regex.Replace(s, “\u203A”, “>”);

// spaces
cleanStr  = Regex.Replace(s, “[\u02DC\u00A0]”, ” “);

return cleanStr ;
}


If you are doing some validation using Regex, here is also how you can check these characters.

JavaScript Regex

function containsWordChar(text) {
var contains;

switch (text) {

case (text.match(/^[\u2018\u2019\u201A]$/)):
contains += “single quotes and apostrophe, “;

case (text.match(/^[\u201C\u201D\u201E]$/)):
contains += “double quotes, “;

case (text.match(/^[\u2026]$/)):
contains += “ellipsis, “;

case (text.match(/^[\u2013\u2014]$/)):
contains += “dashes, “;

case (text.match(/^[\u02C6]$/)):
contains += “circumflex, “;

case (text.match(/^[\u2039]$/)):
contains += “open angle bracket, “;

case (text.match(/^[\u203A]$/)):
contains += “close angle bracket, “;

case (text.match(/^[\u02DC\u00A0]$/)):
contains += “spaces, “;

default:
contains += “double quotes”;

}

return contains;
}


C# Regex (MVC)

[RegularExpression("^[\u2018\u2019\u201A\u201C\u201D\u201E\u2026\u2013\u2014\u02C6\u2039\u203A\u02DC\u00A0]+$", ErrorMessage = "Your content contain some Microsoft Word Windows-1252 character encoding.")]


C# Regex

Public string containsWordChar(text) {
String contains;

switch (text) {

case (text.IsMatch(@”^[\u2018\u2019\u201A]$”)):
contains += “single quotes and apostrophe, “;

case (text.IsMatch(@”^[\u201C\u201D\u201E]$”)):
contains += “double quotes, “;

case (text.IsMatch(@”^[\u2026]$”)):
contains += “ellipsis, “;

case (text.IsMatch(@”^[\u2013\u2014]$”)):
contains += “dashes, “;

case (text.IsMatch(@”^[\u02C6]$”)):
contains += “circumflex, “;

case (text.IsMatch(@”^[\u2039]$”)):
contains += “open angle bracket, “;

case (text.IsMatch(@”^[\u203A]$”)):
contains += “close angle bracket, “;

case (text.IsMatch(@”^[\u02DC\u00A0]$”)):
contains += “spaces, “;

default:
contains += “double quotes”;

}

return contains;
}

 

Advertisements

3 thoughts on “MS Word special characters Regex

  1. john says:

    Woah! I’m really digging the template/theme of this website.

    It’s simple, yet effective. A lot of times it’s tough to get that “perfect balance” between user friendliness and visual appearance.
    I must say you have done a superb job with this. Also, the blog loads extremely
    fast for me on Safari. Outstanding Blog!

    Like

Leave a message please

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s