Data360 Analyze

 View Only
Expand all | Collapse all

Run Data Validation rules in D3S Analyse

Scott PERRY

Scott PERRY07-08-2021 03:30

  • 1.  Run Data Validation rules in D3S Analyse

    Posted 04-28-2021 09:44

    Hi,

    I have a requirement to run some standard DQ validation rules in a generic way on any input file based off of a config file per input file. Please can you advise the best way to do that in D3S Analyse?

     

    Example Use Case below:

    Inputs:

    1. InputData.csv  - File containing data to run data validation rules on. Example data:

    TradeId, LEI, TradeDate, Notional, BuySellFlag

    001, ,22-01-2021, 100, B

    002,ABC,29-04-2021, , X

     

    1. TestCasesForInputData.csv

    TestNumber, TestFieldName, TestType, TestTypeValue

    TEST_01,Notional,NotNull, n/a

    TEST_02,LEI,NotNull, n/a

    TEST_03,TradeDate,DateGreaterThan, 28-04-2021

    TEST_04,CheckInEnum,“B, S”

     

    Then for each row in the Input file, the code will process the row and apply all relevant tests from file TestCasesForInputData.csv to all relevant fields e.g. set TestFieldNameValue to LEI and run TEST_02 which would set error = Y

     

    If TestType = NotNull

    Then

    if (TestFieldNameValue.isNull() and TestFieldNameValue.strlen() == 0)

    then

                    error = "Y"

    else

                    error = "N"

     

    If TestType = DateGreaterThan

    then

    #convert to date first?

    if TestFieldNameValue > TestTypeValue

    then

                    error = "Y"

    else

                    error = "N"

     

     

    If TestType = CheckInEnum

    if TestFieldNameValue NOT IN TestTypeValue

    then

                    error = "Y"

    else

                    error = "N"

     

    #for all tests emit the errors

    emit TestNumber, TestFieldName, TestType, TestTypeValue, TestFieldNameValue

    where error == "Y"

     

    thanks!



  • 2.  RE: Run Data Validation rules in D3S Analyse

    Employee
    Posted 04-29-2021 09:47

    Here is one way that mirrors your example:

    The lower Create Data node represents the imported data from your InputData.csv file and the upper Create Data node is equivalent to the imported data from the InputData.csv file.

    A new field is added to both data sets to indicate the type of record: 'Test' or 'Data' and then the records are concatenated with the 'Test' records at the top of the combined data set.

    A transform node then extracts the details from the 'Test' records and puts the details in a list. 

    Then the set of tests are performed on each 'Data' record. If a test fails the node outputs a record with the details of the failed test and the original input data.

    See the attached example data flow (requires Analyze v.3.6.x).

     

     

    Attached files

    DQ_Validation - 29 Apr 2021.lna

     



  • 3.  RE: Run Data Validation rules in D3S Analyse

    Posted 04-30-2021 09:59

    Hi Adrian, this is excellent thank you! much appreciated. 

    I have a related query.. it seems very similar use case to the above but as I am very new to D3S so would appreciate your help again.

    Before applying the TestCasesForInputData.csv file I first need to filter my input file using a similar config file approach. e.g. a config file called FilterConfig.csv containing rules for when to include records from InputData.csv. For example the below rules would keep the two records from the InputData.csv example file above. 

    FilterNumber, FilterFieldName, FilterType, FilterTypeValue

    Filter_01,TradeId,NotNull, n/a

    Filter_02,BuySellFlag, CheckInEnum,“B, S, X”

    Filter_03,TradeDate,DateGreaterThan, 20-01-2021

    Please could you provide a similar example D3S dataflow?

    Much appreciated

    Scott



  • 4.  RE: Run Data Validation rules in D3S Analyse

    Employee
    Posted 05-01-2021 01:59

    I'm not sure I entirely understand the details of what you are trying to achieve, but it can probably be implemented using a Join node that matches the required fields in the Filter data with the corresponding fields in the Test data:

    The Join node is configured with the default JoinMode (Inner) and the required MatchKeys:

    The node's FieldList property is configured to exclude from the output all fields in the Left data set. In this case the fields were excluded en-bloc (click on the green '+' button to set the include/exclude mode), but you could have individually un-checked each field.

    In this example the *TypeValue fields were used as matches. However, as the values in your Test data and Filter Data do not match these tests would not have been output by the Join node. In the attached data flow I tweaked the filter data so the tests would be output but you may decide to only use the *Number and *FieldName fields in your Join MatchKeys criteria. The remainder of the data flow is as before.

    Please note that the example is only intended as a proof of concept and there is no validation of the environmentals for the tests being performed, e.g. whether the data type of the selected field is appropriate for the tests being performed, etc. The values for the test criteria and the input data (derived from the .csv files) are all assumed to be strings and the logic that performs the tests makes that assumption.

     

    Attached files

    DQ_Validation_w_Filter - 1 May 2021.lna

     



  • 5.  RE: Run Data Validation rules in D3S Analyse

    Posted 05-25-2021 08:58

    Hi Adrian

    Thanks for this filter example. 

    Going back to the test cases, I am trying to add a "script" test case where the test config file can contain a script. So in your "Perform DQ Validation Tests" node i added the following code which works:

    elif test[2] == 'Script':
    if eval("""in1[test[1]] is not Null and len(in1[test[1]]) > 0"""):
    pass
    else:
    out1 += in1
    out1.TestNumber = test[0]
    out1.TestFieldName = test[1]
    out1.TestType = test[2]
    out1.TestTypeValue = test[3]
    node.write(0, out1)

    but when I replace it with the following code it does not work. There is no error, it just does not evaluate. 

    elif test[2] == 'Script':
    if eval(test[3]):
    pass
    else:
    out1 += in1
    out1.TestNumber = test[0]
    out1.TestFieldName = test[1]
    out1.TestType = test[2]
    out1.TestTypeValue = test[3]
    node.write(0, out1)

     

    test[3] is the TestTypeValue from your test cases input data "TestCasesForInputData" (which i switched to a csv delimted input file to allow for escape of double quotes). For some reason it does not evaluate the string from TestTypeValue i.e test[3]. 

    TestNumber  TestFieldName  TestType  TestTypeValue
    TEST_05 LEI Script """in1[test[1]] is not Null and len(in1[test[1]]) > 0"""
           

    Please can you advise how to get this to work?

    much appreciated!

    Scott



  • 6.  RE: Run Data Validation rules in D3S Analyse

    Employee
    Posted 05-26-2021 05:09

    After a lot of head scratching it seems to be a quirk of the eval() function when presented with a string that would normally be considered a multi-line comment - i.e. text surrounded by three sets of double quotes. If the string value for the script only has a single set of double quotes in the Create Data node (i.e. there are no quote characters in the string when the value is viewed in the data viewer) then it works!

     

     

    Attached files

    DQ_Validation_w_Script - 26 May 2021.lna

     



  • 7.  RE: Run Data Validation rules in D3S Analyse

    Posted 05-26-2021 06:38

    Thanks Adrian you're a legend! I've been down lots of rabbit holes trying to fix this so I'm glad it turned out to be doable. Thanks again! :-) 



  • 8.  RE: Run Data Validation rules in D3S Analyse

    Posted 05-26-2021 10:11

    Hi Adrian

    Apols, one more hopefully small question from a D3S newbie. I am trying to assign the output field name for my field name called 'LEI'. Out1.LEI works but how do i pull the field name from my test input data so its dynamic based on the test being executed. I tried out1.test[1] but that doesn't work. Please can you advise?

    out1 += in1
    out1.LEI = eval(test[5]) - This works
    out1.test[1] = eval(test[5])  This doesn't work

    Thanks in advance

    Scott



  • 9.  RE: Run Data Validation rules in D3S Analyse

    Employee
    Posted 05-27-2021 02:16

    Unfortunately that will not be possible. You can dynamically name a field using a property value (this could be a property on the node itself, a parent Composite node, a Data Flow property or a Run property), or based on something derived from the node's input metadata. However, you cannot define the name of an output field based on a value derived from something in an input data record. This is because the output metadata is determined when the Transform node's ConfigureFields script is run. This script runs when the node starts execution and this happens before the first record is read - so no values from the input record are available at this time.

    Her is an example of using a property value:

    A custom property has been defined on the node ('OutputFieldName'). The ConfigureFields script uses a textual property substitution to obtain the value set on the OutputFieldName property and assigns the value to the 'newFieldName' variable. If no value has been set on the property then the default value is used instead ("_DefaultName"). The out1[<string>] method is used to specify the name of the field and the value of the newFieldName variable will be used as the name of the output field. The newFieldName variable's value is also used in the ProcessRecords script when assigning a value to the required output field.

    The following example shows how an aspect of the input metadata can be used to dynamically set an output field name:

    The ConfigureFields script (starting line 10) checks whether there is a field called 'color' in the input metadata. If it is present the 'dynamicName' variable is set to the specifed value and used to create the new output field. If there was no 'color' field in the input metadata a different value is assigned to the dynamicName variable and again this is used to create the new output field. The dynamicName variable is available when the ProcessRecords script so it can be used when assigning a value to the output field when processing records (as described in the previous example).

    Please see here for further information on the Transform node:

    https://support.infogix.com/hc/en-us/articles/360018962334-Using-the-Python-based-Transform-Node

    And the Python scripting > API and script bindings topic in your Help documentation, or the online documentation for the latest version of the product, here:

    https://d3sa-preview.infogixsaas.com/docs/dist/help/Content/f-script/python/python-api.htm

     

    You can use the Execute Data Flow node to pass values calculated in  one data flow into a 'child' data flow. The values that are passed in to the child data flow are part of the state of the child data flow and are treated in a similar way to other Run properties. See the 'Running with data driven properties' section of the Execute Data Flow node's Help

    https://d3sa-preview.infogixsaas.com/docs/dist/help/Default.htm#e-node-help/Interfaces_and_Adapters/execute-data-flow.htm



  • 10.  RE: Run Data Validation rules in D3S Analyse

    Posted 05-27-2021 03:04

    Hi Adrian

    Thanks for the reply. Thankfully my output fields name list will always exist beforehand so the fields will always be on the input and therefore output file so I don't need to create a new fieldname dynamically (apologies I should have been more clear on that). I had a play around just now and seem to have a solution using the exec function as follows. Please let me know if there is a better way of doing it?

    out1.LEI = '123ABC' - This works fine (obviously)

    out1.test[1] = '123ABC' -  As we saw, this doesn't work even though test[1] ="LEI".

    Using exec function:

    tmp = "out1." + test[1] + " = " + "123ABC"

    exec (tmp) - this seems to work fine. 

    Kind regards

    Scott



  • 11.  RE: Run Data Validation rules in D3S Analyse

    Posted 05-28-2021 01:18

    Hi Adrian

    Just expanding out the types of tests I will be able to execute from my test configuration file to now include an if else script. I can get the If statement to work directly in an exec statement in the code but not an If, else statement. 

    For example

    This works:

    exec("if in1[test[1]] is not Null and len(in1[test[1]]) > 0:Tmp = 'BUYBUY' ") 

    This doesn't work:

    exec("if in1[test[1]] is not Null and len(in1[test[1]]) > 0:Tmp = 'BUYBUY' else: Tmp = 'SELLSELL' ")

    Please can you help me out with how to do that?

    Thanks in advance

    Scott 



  • 12.  RE: Run Data Validation rules in D3S Analyse

    Employee
    Posted 05-28-2021 02:28

    You may want to try restructuring the if else statement as a tertiary operator assignment:

    https://www.webucator.com/how-to/how-do-ternary-operator-assignment-python.cfm

     



  • 13.  RE: Run Data Validation rules in D3S Analyse

    Posted 05-28-2021 02:36

    Sounds good thanks, I'll give it a try. 



  • 14.  RE: Run Data Validation rules in D3S Analyse

    Posted 05-28-2021 08:43

    Hey Adrian, just to let you know that tertiary operator assignment worked. thanks again!



  • 15.  RE: Run Data Validation rules in D3S Analyse

    Posted 06-02-2021 04:22

    Hi Adrian

    Hope you had a nice long weekend!

    Another query for you...after the "Perform DQ Validation Tests" node I need to split the output into multiple  outputs at runtime, split by TestFieldName. Any suggestions? The split would need to be dynamic at runtime, i.e. not hardcoding the test field names in a split node. 

    I should also add that once the split is done by TestFieldName i need to be able to filter each split output using a config file (which i can see is possible by feeding into the change metadata node)

    thanks in advance!

    Scott 

    ps i just found this reference from some years ago whereby it seems a solution was made but I cant open the example file perhaps as its an old brg file from 6 years ago.

    https://support.infogix.com/hc/en-us/community/posts/360051778573-Dynamic-Outputs-Creating-multiple-dynamic-output-files-from-one-data-source-

     



  • 16.  RE: Run Data Validation rules in D3S Analyse

    Employee
    Posted 06-02-2021 07:56

    There is no means to have a truly dynamic split because you need to manually create the output pins of the node. If you know the number of unique tests (ie test name values) then you can create a Transform node with N + 1 outputs and steer records to one of the outputs based on the value of the TestFieldName in the input records:

    If the value did not match the record would be steered to the final output and, the number of records counted in an Aggregate node:

    If there were unrecognised records this could be used to generate an error:

     

     

     

    Attached files

    DQ_Validation_Results_Split - 2 Jun 2021.lna

     



  • 17.  RE: Run Data Validation rules in D3S Analyse

    Posted 06-02-2021 08:45

    thanks. I also found this previous post. 

    https://support.infogix.com/hc/en-us/community/posts/360051778573-Dynamic-Outputs-Creating-multiple-dynamic-output-files-from-one-data-source-

    There is .brg file at the link that has a "Node 2 (Output To Multiple Excel Files By Group)" that sounds like it does what I need but I cant open the brg file as don't have lavastorm. Was "output to multiple excel files by group" a lavastorm node that is no longer available? 



  • 18.  RE: Run Data Validation rules in D3S Analyse

    Posted 06-02-2021 09:29

    Hi Adrian

    Just to let you know I was able to import the brg file but first had to open the file in a text editor and change the BRGVersion:5 to BRGVersion:6. Once update the "RootFileName" path location to my c drive it works! Just need to see if i can enhance it for my use case now..

    Scott



  • 19.  RE: Run Data Validation rules in D3S Analyse

    Employee
    Posted 06-03-2021 08:25

    The brg file is for the legacy Lavastorm LAE product. As you note, only version 6 .brg files can be imported into Analyze. However version 5 files are not guaranteed to work when imported so your mileage may vary.

    I'm not sure I can recommend it's use unless you are familiar with the Lavastorm product as the nodes use the legacy scripting language. 

    It wasn't clear from your earlier messages, but if your use case is to segment a data set so it can be written to multiple files you can achieve this in a simpler manner - provided the output file format is a csv/delimited file. The Output CSV/Delimited node can be configured to obtain the output file name from an input field by using the 'From field' source option on the Filename property:

    See the attached example.

     

    Attached files

    Split_Data_Into_Multiple_CSV_Files - 3 Jun 2021.lna

     



  • 20.  RE: Run Data Validation rules in D3S Analyse

    Posted 07-08-2021 02:10

    Hi Adrian

    Just exploring this suggestion you made in the graph above for splitting data into multiple csv files. I can see this works well in your example but I have an additional requirement to also specify particular fields in each output file that I am able to do in the do-while-loop of the other graph using a set metadata node before creating the output file. 

    Is there anyway to use the filename(from field) approach but also drive the output fields per file based on data in the input file?

    thanks

    Scott 

     



  • 21.  RE: Run Data Validation rules in D3S Analyse

    Employee
    Posted 07-08-2021 03:28

    Hi Scott, 

    No, you cannot subset the fields that are output in the csv files on a per-file basis, all of the fields will be output to each of the csv files.



  • 22.  RE: Run Data Validation rules in D3S Analyse

    Posted 07-08-2021 03:30

    okay thanks