We all know that one of the biggest tasks is cleaning our data and getting it ready for analysis. Once you’ve spent all that time cleaning your data I will argue you should store it in a relational database. While you may not do all your analysis there – my upcoming workshop at ODSC West 2019 may surprise you in how much you can actually get done with PostgreSQL.
Hear me out as I explain why a relational database actually helps you clean and tidy your data.
- First of all, most of our data is relational and the database actually helps enforce those relations. Having your data in a CSV file or in a document store does not prevent the accidental deletion or altering of relations.
- Many relational databases also provide transactions, which prevent the entry of half-baked data by rolling back data entry if something goes wrong half-way through submitting the data.
- Relational data stores also have the ability to enforce, to varying degrees, storing the right type (numeric, string, date…) of data with the right variables. This enforcement can prevent errors on data entry and on mass importing of data, providing at least a first line check on data cleanliness.
- Relational data stores allow you to perform operations on the data right where it is stored. There is no need to pull all the data across the wire, into some local data structure, and then perform calculations on it there. Do the operations right there on the data and then pull the answers down.
- Finally, the PostgreSQL developers have spent years getting the query optimizer to efficiently carry out your queries so you can just focus on getting the data you want.
[Related article: Switching Between MySQL, PostgreSQL, and SQLite]
You probably knew most of this about RDBMS (PostgreSQL in particular). What I hope to teach you in my session at OSDC is all the interesting things PostgreSQL has to offer to make your big data cleaning and analysis easier. Let me give you the outline (and some links) on what we are going to cover in the session.
Spatial Data – PostgreSQL has the gold standard in spatial data extensions – PostGIS. If you have data that has direct spatial information, like coordinates, or indirect, such as an address, you can leverage the power of spatial analysis to enhance the insights into your data. In the workshop, we show some of the simple queries like closest distance but also buffering and selecting.
[Related article: Getting the Most Out of Spatial Data]
Full-Text Search – This one will probably come as a surprise to most developers but PostgreSQL has advanced full text search – like Lucene (the “engine” under ElasticSearch and SOLR). You have probably used the LIKE or ILIKE operators or maybe even fuzzy string matching, but full text capabilities into Postgresql deal with “documents” and building document indices with specialized dictionaries.
Key-Value Columns – If you are interested in having more loosely defined key-value data stored with your relational data you can use the PostgreSQL hstore extension. Rather than having to have a defined schema with pre-defined columns and types for all your data, you can use an hstore column type to store any number of key-value pairs. Operators in this extension allows you to query the data in multiple ways. It’s like a Python dict or Java map in the database.
JSON Document Storage – With the rise of modern web applications we saw the rise of JSON as a data format. It turns out that the key-value data store functionality is actually just a subset of what we can do with a JSON type. Not only do we get all the ability to store, query, and perform calculations on JSON documents, we can rely on the relational power of the database where needed. It’s the best of both worlds.
Embedding R in PostgreSQL – Most of you have probably used R for some of your data analysis. Well, now you can do your R development using whatever tool you normally use and then expose that as a function to database users. This way you can work to create the proper statistical functions but then give access to users in an easy and safe manner as a PostgreSQL function. Again this is a great way to do your analysis close to where the data resides and then pull back just the results you need. As a side note, if you use Python for your data analysis you can embed that as well.
I hope you can see that PostgreSQL should be a valuable tool in your big data/data science toolbox. Coming to OSDC West in San Francisco and signing up for my workshop, “All The Cool Things You Can Do With Postgresql to Next Level Your Data Analysis,” should set you up to go home and really dig in with PostgreSQL and your data. I will be sharing with you some of the great insights from my Crunchy Colleagues and the whole PostgreSQL community. Hope to see you then!