Data360 Analyze

 View Only

Split or Remove Rows with Missing Data

  • 1.  Split or Remove Rows with Missing Data

    Employee
    Posted 04-03-2019 04:37

    A Data3Sixty Analyze user recently needed to split out rows from a data set that had missing values in a particular field. In this case the field was a string so the requirement was to identify the rows that contained Null values or where the value appeared blank (an empty string or only whitespace), and the Split node or Transform node could be used to split the data.

    In some situations you may want to split out rows where any field, or a subset of the fields, had missing values. The previous methods can still be used but the nodes would need to be configured with each of the fields that needed to be checked.

    Attached is a community custom node that enables you to scan all fields in the data for missing values. Rows with no missing values (aka complete cases) are output on the node's first output pin and the rows with missing values are output on the output second pin. By default the node will scan all fields but you can also specify a comma-separated list of fields to be checked, or a comma-separated list of fields to be excluded from the checks (you can't specify fields to be included and fields to be excluded). In addition to identifying missing values in string/unicode data type fields, the node will identify missing (Null) values in fields with other data types (integer, dates, Boolean, etc). Field names can by entered manually or can be inserted using the Input Fields reference menu (though you will have to manually enter the commas); the apostrophes are only necessary when the field name contains space characters.

    The node in the attached data flow is a standard Composite node but, if required, you can convert it into a library node (select node -> Right-click -> Convert to library node). Note, you will need to be running Data3Sixty Analyze 3.4.0 to import the data flow.

    I hope you find it useful.

     

    P.S.  The attached node does not identify the string "NA" as a missing value. If you are dealing with data that was exported from an R statistical analysis environment then you may want to modify the definition of 'missing' to include the "NA" string. To do so select the Transform node within the Composite and find the two lines in the ConfigureFields script that contain

      if fields[_fieldName] is Null or fields[_fieldName].strip() == "":

    and change them to be:

      if fields[_fieldName] is Null or fields[_fieldName].strip() == "" or fields[_fieldName] == "NA":

     

    Attached files

    Complete_Cases_Node - 3 Apr 2019.lna