Remove Unicode characters while querying in Hive

12,475

You may use

select REGEXP_REPLACE(YOUR_STRING_HERE, '\\P{ASCII}.*', '')

It will remove all the string up to its end from the first found non-ASCII char.

Hive regex supports Unicode property classes, and \p{ASCII} matches any ASCII char. The opposite Unicode property classes are formed by turning p to upper case. So, \P{ASCII} matches any char that is not ASCII. .* matches any 0+ chars as many as possible, as * is a greedy quantifier.

Note that . does not match line breaks by default. If you need to remove line breaks, add (?s) at the start of the pattern:

'(?s)\\P{ASCII}.*'
Share:
12,475
haimen
Author by

haimen

Updated on June 13, 2022

Comments

  • haimen
    haimen almost 2 years

    I want to clean the unicode the data from the Hive table. The following is the data,

    select ('http://10.0.0.1/���m��v������)�a�^�����kn:4�+9x�2c��m�{��')
    

    My required output is to find if there are any unicode characters in my column and to remove it. The output here should be,

    http://10.0.0.1/
    

    or completely null. Either of them is fine. If a row contains any unicode character, it is fine to make it null completely.

    The following are my tryings,

     select REGEXP_REPLACE('http://10.0.0.1/���m��v������)�a�^�����kn:4�+9x�2c��m�{��', '\\[[:xdigit:]]{4}', '')
    

    and

     select REGEXP_REPLACE('http://10.0.0.1/���m��v������)�a�^�����kn:4�+9x�2c��m�{��', '[||chr(128)||'-'||chr(255)||]', '')
    
    Executed as Single statement.  Failed [40000 : 42000] Error while compiling statement: FAILED: ParseException line 1:193 mismatched input '<EOF>' expecting ) near ')' in function specification 
    Elapsed time = 00:00:00.220 
    
    STATEMENT 1: SELECT Statement failed. 
    

    Can somebody help me in cleaning these in my table ?

    Thanks

    Edit:

    Places where is is working,

    select REGEXP_REPLACE('"http://r.rxthdr.com/w?i=s�F�""�HY|�K�>�0����D����W8뤒�O0�Q�D�1��Vc~�j[Q��f��{u�Be�S>n���Ò���&��F9���C�i��8:ڔ�_@ĪO��K?�Ēc�6��=��v[�����D�$%��:�a�40ݩ�&O��K��""�0�a<x��TcX���b��TN�}�x�o��UY$K�I�Օ""��(+�M���E�=K�A�I�A���q#l�(�yt�5��h}��~[��YOA��G�=ïˆï¿½{���. �Q���ح;x=�s�0:�', '(?s).*\\P{ASCII}.*', '')
    

    Places where it is not working,

     select REGEXP_REPLACE('c4k0j,}W""d+2|4y0hkCkRh+.{pq80{?X8O>b<:ph.3!{T', '(?s).*\\P{ASCII}.*', '')
    
     select REGEXP_REPLACE('z|""},}69]6N2|c_;5.su={IU+|8ubq1<r$!Xxy#?Bhkv20:jXNgRh+5fwj:ndfWBJ}e)>','(?s).*\\P{ASCII}.*', '')
    

    The first one in the image has a unicode character. But while pasting it becomes a dot.

    enter image description here

    Can you please help me in doing this?

  • haimen
    haimen almost 7 years
    select REGEXP_REPLACE('z|""},}69]6N2|c_;5.su={IU+|8ubq1<r$!Xxy#?Bhk‌​v20:jXNgRh+5fwj:ndfW‌​BJ}e)>', '(?s)\\P{ASCII}.*', '') he following doesn't work.. Can you help me in solving this?
  • haimen
    haimen almost 7 years
    There are few unicode values which are not shown up here. Can we do something like if there are any unicode, make the entire row null?
  • Wiktor Stribiżew
    Wiktor Stribiżew almost 7 years
    Yes, to remove the whole row add .* before \P{ASCII}: '(?s).*\\P{ASCII}.*'
  • Wiktor Stribiżew
    Wiktor Stribiżew almost 7 years
    @haimen Your update does not help much, as the pasted texts seem to have no non-ASCII chars. Please paste the 2-3 char chunk with those strange chars into a Unicode converter green field and let me know the hex codes.