Data360 DQ+

 View Only
  • 1.  HDFS Input Data Stores: Regular Expression for File Path?

    Posted 03-11-2020 12:51

    I'm trying to read an HDFS file (directory really as typical) that holds data in a series of files bucket_NNNNN, but also contains another file _orc_acid_version that does not contain data and should be excluded. Here's a snapshot from the Ambari file browser:

     

    I believe I should be using the "Regular Expression for File Path" feature of the HDFS data store definition screen, but nothing I try works for me. I find the online documentation a bit vague on this feature. My questions:

    • Should the "Path:" parameter be left blank when providing a regular expression in the "Regular Expression for File Path:" parameter? Or should the "Path:" parameter contain the fixed portion of the path and the regular expression parameter specify the variable part?
    • In my case, the directory path is of the form /some_directory/delta_0000001_0000001_0003/<bucket files>. I expected a regular expression like "/some_directory/[_0-9]*/bucket_[0-9][0-9]*" to work (the test feature says it's correct). Are there limitations on the format of the regular expression I must follow?

    I'm stuck, so any help would be appreciated.

    Thanks,

    Rob



  • 2.  RE: HDFS Input Data Stores: Regular Expression for File Path?

    Posted 03-12-2020 13:18

    Hi Rob,

    Can you try this expression for File Path .bucket_.*



  • 3.  RE: HDFS Input Data Stores: Regular Expression for File Path?

    Posted 03-12-2020 14:13

    Brenda, the you've worded it you seem to suggest I try this:

    But that doesn't work ("Incomplete HDFS URI"). But I think you're suggesting to try this:

    That doesn't throw an error like the first attempt,but doesn't retrieve any data from the directory, even though the editor regex parser confirms the expression should match "/bucket_00000", which is a non-empty file in the directory. Are the "Path:" and "Regular Expression for File Path:" concatenated to determine access? Or should "Path:" be empty with the other parameter is used?  

     



  • 4.  RE: HDFS Input Data Stores: Regular Expression for File Path?

    Posted 03-13-2020 07:54

    Hi Rob,

    Your second approach seems to be correct with the regular expression as .*bucket_.* If the file is a text file then filter extension should be mentioned as .txt.

    Regarding your question if the patch should be empty...No it should not be empty. It can have full path or even the root directory (in this case /tmp/some_directory) if it is unique.