Replace unicode characters from a Java String

It is a very common scenario that when your application depends on some external sources for input files and then it might be possible that these files may contain some special characters or might be different encoding format is used when this file was written in comparison to what is used for reading. Now for further processing it is important that these special characters should be removed. In this post we will see how to replace unicode characters from a Java String with their corresponding ascii values.

Before looking into the actual java code for replacing unicode characters , lets see what actually Unicode means.
As per the unicode.org definition.

“Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.”

Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. No single encoding could contain enough characters: for example, the European Union alone requires several different encodings to cover all its languages. Even for a single language like English no single encoding was adequate for all the letters, punctuation, and technical symbols in common use.

For example : Just have a look at this text which contains one of the unicode characters.

TN\u2122 official trademark for Tech Notes

Actual ascii representaion for this text is

TN™ official trademark for Tech Notes

here ™ is replaced by its corresponding unicode character that is \u2122.

Below is the code that will replace all of your Unicode characters to its corresponding ascii value.

public static StringBuffer removeUTFCharacters(String data){
Pattern p = Pattern.compile("\\\\u(\\p{XDigit}{4})");
Matcher m = p.matcher(data);
StringBuffer buf = new StringBuffer(data.length());
while (m.find()) {
String ch = String.valueOf((char) Integer.parseInt(m.group(1), 16));
m.appendReplacement(buf, Matcher.quoteReplacement(ch));
}
m.appendTail(buf);
return buf;
}

Here , in the above code I am using regular expression to match the sequence \uXXXX , for that I have created a pattern object that I will apply on the data in which I want to search and replace unicode characters.

Here, as Java string literals use \ to introduce escapes, the sequence \\ is used to represent \. Also, the Java regex syntax treats the sequence \u specially (to represent a Unicode escape). So the \ has to be escaped again, with an additonal \\. So, in the pattern, “\\\\u” really means, “match \u in the input.”

To match the numeric portion, four hexadecimal characters, use the pattern \p{XDigit}, escaping the \ with an extra \. We want to easily extract the hex number as a group, so it is enclosed in parentheses to create a capturing group. Thus, “(\\p{XDigit}{4})” in the pattern means, “match 4 hexadecimal characters in the input, and capture them.”

In a loop, we search for occurrences of the pattern, replacing each occurrence with the decoded character value. The character value is decoded by parsing the hexadecimal number. Integer.parseInt(m.group(1)) means, “parse the group captured in the previous match as a base-16 number.” Then a replacement string is created with that character.

When you execute the above method by supplying an input string containing unicode characters it will replace and decode them.

Other important thing to keep in mind is the encoding format that is used while writing the file because you have to use the same encoding format for reading also otherwise it will break.
Encoding format can be set in InputStreamReader like this.

FileInputStream inStream = new FileInputStream(inputFileLocation);
//UTF8 encoding was used when file was written.
InputStreamReader streamReader = new InputStreamReader(inStream, "UTF8"); 
BufferedReader bufferedReader = new BufferedReader(streamReader);

Feel free to drop a comment in case of any issue or having any problem with the above code

PS : I have not tested the above code for all the unicode characters but yes for most of them.

Replace unicode characters from a Java String

Let'sConnect

Saurabh Jain

A Developer working on Enterprise applications ,Distributed Systems, Hadoop and BigData.This blog is about my experience working mostly on Java technologies ,NoSQL ,git , maven and Hadoop ecosystem.
Let'sConnect

Share and Enjoy

  • Facebook
  • Twitter
  • Delicious
  • LinkedIn
  • StumbleUpon
  • Add to favorites
  • Email
  • RSS
Add Comment Register



Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>