Data360 Analyze

 View Only
  • 1.  Searching with unicode regular expression

    Posted 02-26-2020 05:22

    I have a bunch of Unicode records containing various Unicode characters (e.g. € and †) which I want to find and replace using D360. 

    I need to use the \u format to use the Unicode Code Points (e.g. \u20AC for the euro sign €)

    If I set up my search patterns in a Create Data node as follows:

    pattern:unicode,replace:unicode
    \u2020,AAAAA
    \u20AC,BBBBB

    then the output pin of the Create Data node renders the characters as expected († and €)

    And my search and replace (using Python's re package) works as expected as well.

    However, if I put the above pattern text into a UTF-8 encoded CSV file and bring in the data via the CSV/Delimited node (with Typed Headers = TRUE), the output pin does not render the characters as above, but instead I see:

    And my Python search and replace does not work (it doesn't find the desired patterns).

    How do I get the CSV input to work? I need to provide a client with an externally configurable list of Unicode patterns, so I can't hard-code into a Create Data node.



  • 2.  RE: Searching with unicode regular expression

    Employee
    Posted 02-26-2020 05:38

    You can just copy the actual unicode characters into your UTF-8 encoded csv file and then import them using the CSV/Delimited node

     

     

    Attached files

    ExtendedChar_Replacements.txt

     



  • 3.  RE: Searching with unicode regular expression

    Posted 02-26-2020 05:41

    Thanks Adrian, but I'm not certain that all the required characters are displayable - but I still need to find and replace them. The ones above are for the purposes of the query only.



  • 4.  RE: Searching with unicode regular expression

    Posted 02-26-2020 06:13

    To be honest, the inconsistent behaviour between the Create Data and the CSV nodes itself is concerning. Can a bug report be raised for this?



  • 5.  RE: Searching with unicode regular expression

    Employee
    Posted 02-26-2020 06:49

    I will confer with the team to confirm the situation but I believe the nodes are operating correctly. The CSV/Delimited node expects the data in the file to be encoded using the relevant character set as defined by the FileCharacterSet property. The node does not support escaped  unicode characters.

    In the mean time does this provide you with a solution for escaped UTF-8 characters?:

     



  • 6.  RE: Searching with unicode regular expression

    Posted 02-26-2020 06:52

    Ah, very nice. Was unaware of the 'unicode_escape' option.



  • 7.  RE: Searching with unicode regular expression

    Employee
    Posted 02-27-2020 07:09

    The team confirmed the CSV/Delimited node is operating as expected. The node is designed to import data that is encoded per a supported codec (e.g. UTF-8, UTF-16LE, ISO-8859-1, etc.).