Data Points

 View Only
  • 1.  Working with large files (SQL Server & EmEditor)

    Posted 05-03-2019 16:10
    I'm currently working on a project where I need to work with and query large datasets. My computer can't handle processing really big files so I've been struggling a bit. I finally found a workflow that's getting the job done that I wanted to share incase anyone else is experiencing similar issues.

    What I'm trying to do is isolate specific property records associated to certain people or businesses. I have the list of people/businesses who own the land broken out by state so that helps ease the file size issue a bit. But some larger states were still giving me some trouble. Here's what I've been doing to accomplish what I need without access to a server.

    1) I had the property attribute data for New Hampshire downloaded to my computer, divided into separate .txt files for each county. I wanted to query the whole state so first I needed to combine the .txt files. To do this I downloaded EmEditor. I added all NH county files to it and then used the combine tool to create a state file. Then I defined the delimiter as pipe and identified which row I wanted as the header.

    2) Once I had the property attributes all combined into one state file for NH, I opened SQL Server Management Studio. Using the script provided with my download of the property attributes data I created a table for NH that I could load the data into.
     
    3) To import the data I used the following script (with the exception of part of the path name I had to blur out) :

    4) Then I was ready to query the data and find the property records I needed: 

    So far I'm having success with this method. Has anyone else ever had to work with large file sizes in an environment that can't typically handle them? If so, I'm interested to hear what other ways I can work on this.

    ------------------------------
    Briana Brown
    Data Product Marketing Manager
    New York, NY
    ------------------------------


  • 2.  RE: Working with large files (SQL Server & EmEditor)

    Employee
    Posted 05-04-2019 05:09
    Query and Analysis using Open Source Software (PostgresDB)
    • Install PostgresDB based on Operating System (Windows or Mac OS)
    • Copy data to local drive
    • Execute create statement to create table in which files will be loaded (specify table name, field name, data type and length, indexing info, primary key etc.)
    • Create batchfile in the same folder where the files are present
      Example of batch file content to load multiple *.txt files : 
    for %%f in (*.txt) do psql -d your_database_name -h localhost -U postgres -p 5432 -c "\COPY public.tablename FROM '%%~dpnxf' DELIMITER ''| CSV;" 
    pause

    • Run batchfile through command prompt to load data in created table
    • Execute Standard SQL queries to extract required information from table loaded with data


    ------------------------------
    Dr Neena Priyanka
    Pitney Bowes Software India PVT. Ltd
    Noida
    ------------------------------



  • 3.  RE: Working with large files (SQL Server & EmEditor)

    Employee
    Posted 05-04-2019 05:21
    Amazon provides a few services to query and analyze large files if you are not familiar with big data technologies.
    Query and Analysis using Amazon Athena
    • Set-up an amazon account/ or use Enterprise Account set-up by your respective organisation
    • Copy data to S3 bucket (ensure extracted file is kept such *.txt or *.csv or *.tsv if compression is other other than *.gzip or *.biz)
    • Connect Athena to some data stored in Amazon Simple Storage Service (S3), identify fields, run queries, and get results in seconds. 
    Note: Athena is a serverless query service that makes it easy to analyze large amounts of data stored in Amazon S3 using Standard SQL Queries. You don't have to build, manage, or tune a cluster or any other infrastructure, and you pay only for the queries that you run.
    Weblink: https://aws.amazon.com/blogs/aws/amazon-athena-interactive-sql-queries-for-data-in-amazon-s3/

    Query and Analysis using Redshift
    • Set-up an amazon account/ or Enterprise Account set-up by your respective organisation
    • Copy data to S3 bucket (ensure extracted file is kept such *.txt or *.csv or *.tsv if compression is other than GZIP or BZIP2)
    • Launch an Amazon Redshift cluster.
    • Install a SQL client
    • Connect your SQL client to the cluster master database.
    • Load data from S3 into Redshift cluster and start analyzing data using Standard SQL Queries
    Note: Amazon Redshift is an enterprise-level, petabyte scale, fully managed data warehousing service. 
    Weblink: https://docs.aws.amazon.com/redshift/latest/dg/t_Loading_data.html


    ------------------------------
    Dr Neena Priyanka
    Pitney Bowes Software India PVT. Ltd
    Noida
    ------------------------------



  • 4.  RE: Working with large files (SQL Server & EmEditor)

    Employee
    Posted 05-06-2019 08:44
    Hi Briana, Over the years I have found that the easiest way (for me) to combine multiple text files into a single file is by using a command line in a dos window. The command "type" will write out all records within a file to your screen. You can easily direct the output to write to a file that you specify, instead of to the screen.

    So in your case, if each of your county files started with "NH", you could use the command "type NH*.txt > output.txt". This would essentially write all of the records in each NH*.txt file to the output.txt file. The only trick is the handling of the header rows in each file. You can either remove them all first, and reinsert the header to your output file, or alternatively find them after the fact and remove them from your output file.

    ------------------------------
    Colleen Reed
    PITNEY BOWES SOFTWARE, INC
    Maitland FL
    ------------------------------