LAE

 View Only
Expand all | Collapse all

Dealing with UTF-8 in BRAINscript

  • 1.  Dealing with UTF-8 in BRAINscript

    Employee
    Posted 06-11-2013 10:15

    Note: This was originally posted by an inactive account. Content was preserved by moving under an admin account.

    Originally posted by: pdespot

    Hello, everyone.

    The topic of manipulating UTF-8 strings within the nodes has come up with increasing frequency recently. To that end, I’d like to share a few tips with the community.
    1. Before you can manipulate UTF-8 data you need to make sure those fields are stored as unicode, not string type. You can do this by opening the BRD viewer and looking at the field in question. The data type the field is stored in is directly under the field name. If it’s of type “string”, it may display properly in the BRD viewer, but will not work when you attempt to operate on the string.
    2. UTF-8 encoded field names are not allowed and will cause a node to fail.

    Let’s say you have completed steps #1 and #2 above and now you want to add a filter node that excludes certain records based on the contents of a column. Normally, you would just use the BRAINscript of:
    emit * where columnA <> “my_string”
    If my string happens to be a UTF-8 encoded string, you’ll have a problem because you can’t natively enter that string in the script. For example:
    emit * where columnA <> “只依三“
    Will not work because UTF-8 isn’t supported in the script at the moment.

    The solution is to use the Unicode Converter tool in BRE. To do so:
    1. Open BRE
    2. Go to the View menu and select Unicode Converter
    3. Paste the UTF-8 string you want to use in the script, in our example it’s “只依三“, and click Convert
    4. You’ll get an output that looks something like “\u53ea\u4f9d\u4e09” – that’s the hexadecimal values of the characters you entered.
    5. Go back to your filter node and paste that string wherever you would use a normal string (i.e. functions, variables, etc)
    emit * where columnA <> u”\u53ea\u4f9d\u4e09”
    Note the “u” before the string. That’s needed to indicate that the following should be treated as Unicode




  • 2.  RE: Dealing with UTF-8 in BRAINscript

    Employee
    Posted 08-14-2013 06:53

    Note: This was originally posted by an inactive account. Content was preserved by moving under an admin account.

    Originally posted by: aop

    Hi!

    How to correctly show utf-8 data in BRD viewer after unicode conversion?

    How to get output excel to work with utf-8 data?

    Thanks in advance!


  • 3.  RE: Dealing with UTF-8 in BRAINscript

    Employee
    Posted 08-14-2013 07:04

    Note: This was originally posted by an inactive account. Content was preserved by moving under an admin account.

    Originally posted by: pdespot

    Hello, aop.

    The BRD viewer has a separate configuration for how which character set to use. To change it:
    1) Open the BRD viewer
    2) Go to the Preference menu and select "Set codepages"
    3) That will bring up a dialog box in which you can select your character set
    4) Enter "utf-8" without the quotes and click OK
    5) That should correctly display the UTF-8 characters


  • 4.  RE: Dealing with UTF-8 in BRAINscript

    Employee
    Posted 08-14-2013 07:09

    Note: This was originally posted by an inactive account. Content was preserved by moving under an admin account.

    Originally posted by: aop

    It does work for the original utf-8 data but afterapplying the unicode function for the field it does not display correctly with that codepage.


  • 5.  RE: Dealing with UTF-8 in BRAINscript

    Employee
    Posted 08-14-2013 07:13

    Note: This was originally posted by an inactive account. Content was preserved by moving under an admin account.

    Originally posted by: pdespot

    Interesting. Could you please post the graph with the node making that change and some sample data?


  • 6.  RE: Dealing with UTF-8 in BRAINscript

    Employee
    Posted 08-14-2013 08:21

    Note: This was originally posted by an inactive account. Content was preserved by moving under an admin account.

    Originally posted by: aop

    I attached a zip below with graph and demo data.
    Attachments:
    UTF8.zip


  • 7.  RE: Dealing with UTF-8 in BRAINscript

    Employee
    Posted 08-14-2013 11:19

    Note: This was originally posted by an inactive account. Content was preserved by moving under an admin account.

    Originally posted by: pdespot

    When I run the attached graph it runs and outputs the attached. Note, this is the output pin of the filter node. Also, both Excel files contain the same. Is that what you're seeing?
    brgout.jpg


  • 8.  RE: Dealing with UTF-8 in BRAINscript

    Employee
    Posted 08-16-2013 06:01

    Note: This was originally posted by an inactive account. Content was preserved by moving under an admin account.

    Originally posted by: Tim Meagher

    Hey,

    The problem here is that the Delimited File node does not handle unicode data.
    It always tries to read the data in the LAE's character set.
    The LAE has a configurable character set, which is used for the string data fields and metadata.
    However, this is required to be a single-byte character set (by default this is windows-1252).

    The Delimited File node doesn't actually do any character set manipulation or character encoding/decoding.
    Therefore, the data that is on the output of the node is actually UTF-8 data, which is why if you change the display preferences in the BRD Viewer, it displays correctly.
    However, since this is in a string field, none of the other nodes are able to handle this data as they try to decode it in the LAE's character set.
    When you try and convert this to unicode, you will run into errors for this reason.

    This limitations with both the Delimited File node, and the LAE's character set are logged in our bug tracking system and will be addressed in a future release.
    This, however, requires a large development effort to overhaul all of the node infrastructure that expects all string fields to be encoded using a single byte character set, along with the various pieces of the software that are required to process output and input metadata.


    In the meantime, I have put together a node which I think should help you address the problem of acquiring delimited unicode data in the LAE.
    I have posted the node in the thread: http://community.lavastorm.com/threa...g-unicode-data

    Once the data is in a correctly encoded unicode data type, it should work correctly with the Output Excel node.

    Note that this node is not an officially Lavastorm released or supported node, therefore support will only be provided through the forums until this node or something similar can be integrated into the core product.

    Hope this helps,

    Tim.


  • 9.  RE: Dealing with UTF-8 in BRAINscript

    Employee
    Posted 08-20-2013 01:55

    Note: This was originally posted by an inactive account. Content was preserved by moving under an admin account.

    Originally posted by: aop

    Thanks for the detailed explanation!

    Just to make sure I understood correctly so if I have a data set containing cyrillic or chinese characters (and I can encode/convert them in any encoding UTF-8,UTF16 etc. during extraction before taking them to BRE and I could input them in whatever way is available in BRE as text or throught db query etc.) currently there is no way to connect this input to an excel output and succesfully run it through?

    My current situation is that bringing in UTF-8 data to excel output fails and wrapping the fields in the data with unicode function before running the excel allows me to run the excel output but characters don't show normally.


  • 10.  RE: Dealing with UTF-8 in BRAINscript

    Employee
    Posted 08-20-2013 02:20

    Note: This was originally posted by an inactive account. Content was preserved by moving under an admin account.

    Originally posted by: Tim Meagher

    Hi,

    If this is unicode data and there is no single-byte character set that can encode this data, then it will need to be read into the LAE in a unicode type field.
    For instance, if you are only reading cyrillic data, then you could set your server and BRE character set to be windows-1251, and convert the data from UTF-?? to windows-1251 outside of the LAE, then parse this data.
    However, if you need to read cyrillic data and chinese data, then there is no single character set outside of the unicode character sets that can encode these, and there is certainly no single byte character set that can do so.

    Therefore, they need to be acquired into the LAE in a unicode field.
    The post I mentioned previously, here: http://community.lavastorm.com/threa...g-unicode-data contains a node which allows you to acquire such unicode delimited data into the LAE.
    I just updated this, because I forgot to remove some base libraries that weren't necessary, and I had uploaded the node with some debug flags on, which means that it would have appeared to hang.
    If you try that now, you should be able to use it to get the unicode data into the LAE, and then using the Output Excel node, you should get the correct output spreadsheet.

    Regards,
    Tim.


  • 11.  RE: Dealing with UTF-8 in BRAINscript

    Employee
    Posted 08-22-2013 03:19

    Note: This was originally posted by an inactive account. Content was preserved by moving under an admin account.

    Originally posted by: Tim Meagher

    Just checking - did this work for you?


  • 12.  RE: Dealing with UTF-8 in BRAINscript

    Employee
    Posted 10-11-2013 04:09

    Note: This was originally posted by an inactive account. Content was preserved by moving under an admin account.

    Originally posted by: vkumar

    Hey Tim,

    I was reading some unicode data i have in chinese. I am augmenting some other data & need to export 1 field with "chinese" characters and the rest in normal ASCII but the output needs to be a "|" delimited txt file. Do you think it would be a posibility. I tried but it is outputting the "chinese" data comes out as \uXXXX.

    thanks,


  • 13.  RE: Dealing with UTF-8 in BRAINscript

    Employee
    Posted 01-27-2015 08:00

    Note: This was originally posted by an inactive account. Content was preserved by moving under an admin account.

    Originally posted by: the1don

    Hi

    I am having issues with writing UTF-8 data to Excel similar to what is discussed in this thread. When I try to click on the link Tim provides I get a message that I do not have permission to access the page. Can I get this node or any infomration as to how to export non-ascii data to Excel.

    Thank you
    Don


  • 14.  RE: Dealing with UTF-8 in BRAINscript

    Employee
    Posted 01-27-2015 08:36

    Note: This was originally posted by an inactive account. Content was preserved by moving under an admin account.

    Originally posted by: Tim Meagher

    Hi,

    I believe you need to request access to the Lavastorm Labs section of the forum via the Group Membership section of the Control Panel.

    Tim.


  • 15.  RE: Dealing with UTF-8 in BRAINscript

    Employee
    Posted 07-25-2018 09:29

    Note: This was originally posted by an inactive account. Content was preserved by moving under an admin account.

    Originally posted by: Arunn

    Hi,

    I would need your help in handling the UTF-8 data which comes from the DB in LAE 6.1.4. I have used the DB node to fetch the records which is having the 香港商 data in one of the column and I could see the correct value in the data viewer. But when I try to write the records into the CSV file (Pipe delimiter), it's not showing the correct values instead the junk values (\u5409\u4f51\u6709\) are populated in the CSV file.

    Please help me on this.

    Thanks,
    Ar


  • 16.  RE: Dealing with UTF-8 in BRAINscript

    Employee
    Posted 07-25-2018 10:15

    Note: This was originally posted by an inactive account. Content was preserved by moving under an admin account.

    Originally posted by: gmullin

    This is a custom node (not supported, not part of the package) but works well for writing out UTF8 values to a delimited file. Try replacing the current Delimited output node you have with this one.

    UTF8-DelimitedOutput.brg


  • 17.  RE: Dealing with UTF-8 in BRAINscript

    Employee
    Posted 07-26-2018 09:12

    Note: This was originally posted by an inactive account. Content was preserved by moving under an admin account.

    Originally posted by: Arunn

    Originally posted by: gmullin
    					

    This is a custom node (not supported, not part of the package) but works well for writing out UTF8 values to a delimited file. Try replacing the current Delimited output node you have with this one.

    UTF8-DelimitedOutput.brg
    Thank you very much, Mullin !! It's working as expected.


  • 18.  RE: Dealing with UTF-8 in BRAINscript

    Employee
    Posted 06-11-2019 11:28

    Note: This was originally posted by an inactive account. Content was preserved by moving under an admin account.

    Originally posted by: Arunn

    Hi,

    I'm trying to read the excel file which is having the chinese character in the column name due to which I couldn't read the file.

    ERROR: Input length = 1

    ERROR: Error occurred while attempting to open output (0) : "out1 cause: Error occurred writing metadata for field (1), name: ????, type: class com.lavastorm.lang.UnicodeString.
    "
    Error Code: brain.node.errorOpeningOutput


  • 19.  RE: Dealing with UTF-8 in BRAINscript

    Employee
    Posted 06-12-2019 06:49

    Note: This was originally posted by an inactive account. Content was preserved by moving under an admin account.

    Originally posted by: Arunn

    Can you please help me to read the data from the excel where it contains the column name as chinese character ?


  • 20.  RE: Dealing with UTF-8 in BRAINscript

    Employee
    Posted 06-12-2019 07:20

    Note: This was originally posted by an inactive account. Content was preserved by moving under an admin account.

    Originally posted by: gmullin

    Do you have the DefaultType (Optional tab) set to Unicode?


  • 21.  RE: Dealing with UTF-8 in BRAINscript

    Employee
    Posted 06-12-2019 07:47

    Note: This was originally posted by an inactive account. Content was preserved by moving under an admin account.

    Originally posted by: Arunn

    I tried with Unidode as well but still it's not working.


  • 22.  RE: Dealing with UTF-8 in BRAINscript

    Employee
    Posted 06-12-2019 10:53

    Note: This was originally posted by an inactive account. Content was preserved by moving under an admin account.

    Originally posted by: gmullin

    At first I thought it was just data in a field and not the field name itself. I don't think you're going to be able to read this without manipulating the fieldnames.

    Here's an idea - you might be able to use headerRow="2" to skip the header and then have some logic to rename fields back using inputFields(1). Have a look at this reading a simple Excel and bypassing the header row.

    node:Excel_File
    bretype:core::Excel File
    editor:sortkey=5d012dbf08020e2a
    output:@48cf6cd2305e5f95/=
    prop:File={{^FileName^}}
    prop:UseSystemLocale=false
    prop:WorkbookSpec=<<EOX
    <workbook>
    <sheet index="1" headerRow="2" dataStartRow="2" />
    </workbook>
    EOX
    editor:XY=350,380
    end:Excel_File
    
    node:Filter_3
    bretype:core::Filter
    editor:sortkey=5d013ade07242fe9
    input:@40fd2c74167f1ca2/=Excel_File.48cf6cd2305e5f95
    output:@40fd2c7420761db6/=
    prop:Script=<<EOX
    _inputFields = inputFields(1)
    
    emit _inputFields[0] as "Field1"
    
    EOX
    editor:XY=450,380
    end:Filter_3