Monday, September 29, 2014

Bitnami Open Source Leaders Interview Series: Dave Page from PostgreSQL

Billing itself as the "world's most advanced open source database", PostgreSQL is bundled with many of our most popular Bitnami apps, including our LAPP stack, Dev Pack, OpenERP/Odoo, Discourse and more. To kick off our new Open Source Leaders podcast series, we interviewed Dave Page, the Director of PostgreSQL Europe, to learn more about the PosgreSQL database and what's to come from the project.

Here are just a few of the topics we covered:

How does the PostgreSQL community work?
Why should one be involved in the community?
What is next for PostgreSQL?  
How does PostgreSQL stack up next to no-SQL leaders?
Does PostgreSQL work on the cloud?
What level of experience do you need run PostgreSQL?

You can launch a PostgreSQL application or stack to the cloud with Bitnami for free, or download any of our free native installers or VMs to run the software locally. Get started in the cloud for free with a $200 credit from Microsoft Azure. You can browse a full list of Bitnami app stacks that contain PostgreSQL on the Bitnami site.



Stuart Langridge:        This is the Bitnami Open Source Leaders series of interviews.  I’m Stuart Lankridge and I’m here talking to Dave Page.  Dave is Director of PostgreSQL Europe, Vice Chair of the PostgreSQL Community of Canada, he’s chief architect of Tools and Installs at EnterpriseDB and he’s a core team member of the PostgreSQL project.  Hey Dave, welcome to the interview.

David Page:                Thank you, Stuart nice to talk to you.

Stuart Langridge:        So what is Postgres? 

David Page:                Postgres is an open source relational database management system. We compete mostly with the likes of Oracle and SQL Server.  It’s a completely open source permissive license.  We have a very large community that are both using Postgres and submitting patches on a regular basis.  So we’re pretty diverse in the things that Postgres supports because we have this large community of people from all sorts of different areas and different industries helping us build the product and add support for all sorts of useful features.

Stuart Langridge:        You talked there about the size and diversity of the community.  Is Postgres targeting a particular use case or does it do everything from small data storage up to huge data stores?


David Page:                Absolutely.  We’ve got people running everything from a 10-megabyte database up to 10 terabyte databases.  There are people who are using it for data warehousing, for LTP, and for storing their unstructured data.  It’s pretty versatile and it’s always been aimed at being a general database that’s useful to everyone. 

Stuart Langridge:        Are there particular areas that you’re interested in targeting at Postgres where you haven’t got there yet, where you’re working on getting into those different kinds of environments and uses?

David Page:                The big area for us at the moment is unstructured data.  People have obviously moved very much towards some of the no-SQL databases for that kind of workload.  Postgres is actually pretty well suited to it as well.  We have fantastic support for json, which is being enhanced with a new jsonb data type that came with our 9/4 release, which has proven to be extremely outperforming in all the benchmarks I’ve seen vs. the no-SQL market leader.  It gives users the opportunity to take those unstructured workloads, move them into their relational database and really amalgamate all their data into one location rather than having to run multiple technologies at once. 

Stuart Langridge:        As you say, there’s quite a big trend toward using no-SQL databases. Does that mean that Postgres is now a drop-in replacement for some of the leading no-SQL things, say Mongo DB or Couch DB?

David Page:                It’s not a drop-in replacement because we don’t support their wire protocols.  But it’s certainly a near drop-in replacement.  I mean, pretty much everything you can do in Mongo you can do in Postgres, certainly with json being a slightly less performant within the older json data type, so I know from interactions with people in the community and people that are customers of ours at work that people are finding this extremely exciting - being able to bring all their data into one place. 

Stuart Langridge:        I mean, obviously, you’ve always been able to take a big block of json and bang it into one database field, which has a big long data text string, but presumably json support in Postgres is more detailed than that.  How does it work?  Can you query against specific fields in a json document? Can you aggregate across json documents?

David Page:                Absolutely, you can query within documents, you can construct documents from relational data if you want.  Obviously you can deconstruct data as well back into relational format, and jsonb comes with some new index operators that allow you to do some really efficient indexing of json. One of the cool new features I really like is you have the ability to do queries on sub-documents.  So you can say, "show me all of the documents that contain this sub-document" and give it another json document and it will look for that within all the existing data very, very quickly. 

Stuart Langridge:        So is that just any exact match or do you get essentially the equivalent to SQL-like query where I can say show me the documents which contain a sub-document containing the following things?

David Page:                The sub-document is an exact match, but of course, then SQL allows you the flexibility to do sub-selects from that of course.  So you can do a like match on the results. 


Stuart Langridge:        What else are you doing to help adapt Postgres to be a competitor to or better than existing no-SQL databases?  One of the things that they tend to claim that they excel at is scaling; being able to spin up multiple separate shards without just partitioning your keys across them.  Is that something that Postgres is already good at?  

David Page:                Well, I have to be honest, most of the noSQL databases are pretty good at that.  The reason they’re pretty good at that is because they ignore many of the asset properties that we have to follow in a relational database, which actually makes it really, really hard to do those kind of things in a more traditional, relational database.  But we’re doing a number of things to address that right now.  First off, we’ve got a project underway called the bi-directional replication project that’s being worked on by a number of the community members.  So this builds on work that’s going to be in 9.4 which basically does logical decoding of the writeahead log. So whenever a change is made in the database, the change gets written to the writeahead log first, which is, in all previous versions of Postgres, it’s a binary format log.  What the logical decoding does is allow you to read that log on the fly without having to use triggers or anything like that, and return it as a set of logical changes to the data.  What that then allows us to do is build on top of that technology, things like some very powerful bi-directional replication tools for example, so we can take those logical change sets and apply them onto the servers, which has allowed us to do filtering of data along the way, and most importantly be very efficient, because it’s something that happens kind of out of process.  It’s not being held up like triggerette-based replicational systems.  It’s not holding up the individual transactions whilst the trigger executes. 

Stuart Langridge:        So you can bring up multiple different shards and then have individual subsets of your data replicate out those shards for failover to just shard the data across different things for scaling or however you might want to do it.

David Page:                Yeah, that’s one potential use case.  At the moment it’s in the fairly early stages.  The infrastructure is in our 9/4 release to actually decode the log files, and everything else is kind of modulate.  You can build code that will read those logs and do whatever you need with it, and obviously there are projects getting under way in the community to do all sorts of things with replication, auditing, and so on and so forth.  One of the other areas where we’re working on addressing the needs of users that are working with unstructured data is our foreign data wrappers, which I’m a big fan of and very interested with.  These allow you to load a driver. For example, you can load a Mongo DB driver, and then you can set up a query or a database at the Mongo DB end, which is represented as actual data as a regular table within Postgres, and what this allows you to do is actually connect to those Mongo DB servers that need to be running Mongo for whatever reason and query them as if they were a structured data source as with any other table.  We have a bunch of foreign data wrappers now for everything from LDAP, Twitter, Mongo, Couch, ODBC, JDBC, and even for the other relational databases like MySQL and Oracle. This brings Postgres to being sort of the central data store or data source, because you can connect to everything else from Postgres.

Stuart Langridge:        So I can essentially use almost any data source I like as basically a back-end for Postgres and just do everything by talking to my Postgres database?

David Page:                Absolutely, there’s a proof of concept FDW written for Twitter, so you can do select star from Twitter stream.  There are other FDWs that know how to reach CSV files. It is very powerful, and gives you ways of doing imports, doing data loads, running reports across multiple data sources, no end of possibilities, really.

Stuart Langridge:        Absolutely, that’s fascinating. You mentioned that these are being written by the community, so you mentioned that you’ve got a reasonable sized community already, but are people generally contributing to the core Postgres itself, building extensions for it or just a large user community helping one another out?  What sorts of communities do you have and how do they interact?

David Page:                All of the above and more, really.  We obviously have a group of people that work on the core Postgres server, and that’s some pretty complex technology, so it’s not something that everybody wants to get involved in.  We have other people that are working on drivers or just spend a little bit of time writing a foreign data wrapper or something, as well as the development community for the server itself and the add-ons to the server. We have people working on things like alternative replication tools, on management tools, drivers for different languages, really it’s a pretty vast ecosystem around the core server.  In addition to those people, we also have people that are users, they come to conferences, they join us there for a day’s worth of talks, and maybe go there in the evening.  We also have regular user groups in lots of cities. It’s pretty diverse.

Stuart Langridge:        If I’m using Postgres in a semi-serious way in my organization, is it worth my while getting involved in the community?  Obviously, if I’ve got tech support questions, I can show up on your forums and ask them, and I suspect people will help, but what do I get from being involved in the Postgres community?

David Page:                If you’re involved, you get the chance to help shape where Postgres is going.  Obviously the more you get involved, the more of a say you have.  If you’re going to contribute code, you get the chance to design that code.  Within the community, people don’t just show up with a patch and we commit it. It’s a very collaborative development process, but by getting involved you really do get to shape how the product will work far more than you will by anything that’s led by a single commercial entity. 

Stuart Langridge:        Let’s talk a little bit about Postgres’s traditional strengths.  Obviously it’s a relational database, you might think of it as two different ends to that.  You’ve got the high-end stuff, so Oracle, MySQL servers as you mentioned, and you’ve got what’s traditionally seen as the lower-end.  MySQL is the obvious example here.  Are there particular areas where Postgres is stronger than the competition, where you would recommend Postgres?  Are there particular areas where they’re not particularly concentrating on where someone would go for an alternative.

David Page:                The Postgres community is, I wouldn’t say averse to the idea, but we’re at the moment not really concentrating on an equivalent to, say, Oracle Rack.  That said, in my experience, most people who have Rack don’t actually need it.  In many cases, they could quite easily replace their system with a couple of Postgres servers, and set up with some failover and appropriate monitoring. Our strength is, and the thing we really do pay a lot of attention to is technical correctness and data correctness, that’s making sure we follow the SQL spec as closely as possible.  As far as I’m aware, Postgres is the most spec-compliant database there is.  Making sure your data’s safe and it’s stored properly.  Data is validated, where you submit a date into a column and whether that’s a valid date.  And that’s always really been the main ethos if you like, of the project, has been correctness comes first.  I think it’s pretty important.

Stuart Langridge:        If you need a motto, that’s not a bad one, I think.  There’s kind of a persistent view that something small, like MySQL is simpler and therefore if you’re only putting together a small project, it’s a better choice.  If you want something larger, more complicated, something where you need to do a lot more with your data, then Postgres is better at the cost of being more difficult to use.  But is that view justified?  There are an awful lot of web hosting environments out there, which just give you a control panel and PHP and MySQL, it’s almost always MySQL.  But is that just a historical artifact or are you working on making Postgres more appealing to people in that kind of small environment?  

David Page:                Yeah, it’s an interesting question, and something that we’ve thought about long and hard over the years and worked on over many years.  Years ago, certainly MySQL was much easier to set up than use.  Nowadays, the situation I think we’re in is Postgres is easier to set up and use, but Postgres has a lot more advanced features than MySQL, and those are where you’re going to start hitting capacity and flexibility. By definition, although those advanced features are more complex, we’ve done a lot of work over the years to try and improve things for new users.  Seven, eight years ago, MySQL had nice GUI installers and when you came to use Postgres, it was, well how do I compile this from source and install it?  One of my colleagues in the community and I originally worked on some installers when we first built that and decided at that point let’s try and make things as simple as possible.  We’ll make it really easy for people to get up and running on our new Windows port.  The company I worked for, Enterprise DB took things one step further when we said all right, we’re going to redesign the installers completely.  We recognize some of the limitations in the original ones, and nowadays we have a set of installers for Windows, for Linux, for Mac, that if people choose to use those, they get a very simple experience of just sort of four or five clicks and it’s installed and running.  So I think these days, the argument that MySQL is easier is no longer true. 

Stuart Langridge:        You mentioned there about having different installers for different platforms.  That’s an interesting point.  Let’s talk a little about how one actually gets Postgres.  Do you find that most people are deploying on Linux hosts?  Are you expecting them to use the Postgres from their choice of Linux operating system or do you encourage people to install from PostgreSQL.org?

David Page:                Another good question. Most people will use the installers for testing things out, for the typing, or working on their laptops. When they deploy, what they will tend to do is move either to one of the RPM distributions or the Debian or Ubuntu.  Now, what we found is that, with Redhat for example, Redhat are on a five-year release cycle for REL.  So they’ll lock onto one version of Postgres, one major version at the beginning of that release cycle, when they’re just getting ready to go to beta, and then they’ll stick with that version, so I think it’s Postgres 8.4 is in REL 6 for example, and I think 9.1 is in REL 7.  That really doesn’t help a lot of users who want some of the newer features in Postgres.  Postgres is advancing very quickly.  We have major releases every year and we add new features.  We never add new features to minor releases.  People want to get those new features.  People are going to want to get jsonb support in PostgreSQL 9.4 for example.  So one of the things that the community does is maintain both the young repository and an APT repository on PostgreSQL.org, where users if they prefer not to use the vendor supply copies of Postgres, they can come to us, they can get whatever version they want for whatever version of REL, Fedora, or Ubuntu or Debian they want. 

Stuart Langridge:        So you’re working to make sure that, obviously you’ve got a certain amount of combulatory explosion, and you’ve got various different versions of REL, of Ubuntu server, of Debian, and various different versions of Postgres.  So what do you do about supporting those different versions, or are all versions of Postgres supported on all releases of Ubuntu server?

David Page:                No, we tend to phase out the older versions of Ubuntu as they get replaced by newer versions.  We won’t stop creating the new builds of a particular version of the Postgres server for it, we just might deprecate one for future versions of the Postgres server.  In the case of REL it tends to change slowly, it’s not really a major issue.  It’s more of a problem with Fedora of course, where we’re in the same sort of boat as with Ubuntu, but we do have a pretty wide selection.  I think we normally support somewhere over the last four or five versions of Fedora and Ubuntu typically with any given release of Postgres.

Stuart Langridge:        That makes sense.  I know you’ve been working with Bitnami's BitRock packaging technology as well to put together Postgres.  How’s that been?

David Page:                Fantastic, they’re a great bunch of guys.  I love working with them to be honest.  We use their installers for the Postgres installers that we build.  We also use them for all of our products within enterprise DB.  The technology is great because I can write and install it once and it runs on a ton of different platforms, you know, not just every version of Linux, but they’ll run on HPUX on Highex, and whatever I need.  There’s not much more I can say.  It’s great technology and they’re a really nice bunch of guys.

Stuart Langridge:        So talking about doing new releases and so on, what’s your release strategy, what’s the cadence?  How often do you put out new releases?  What’s the policy on when a new release comes out, what it should have in it, that sort of thing.

David Page:                We put out a new major release of Postgres roughly every 12 months give or take a few weeks.  In those new major releases, we’ll have new features.  We may have database format changes, so you’re required to go through an upgrade process to get from one major release to another.  Each of those major releases will have what we call point releases or update releases that are pushed out when there’s a need.  So if we find a particularly serious bug or we collect a handful of minor bugs, we’ll get to a point where we say, it’s time to release that.  Those releases never, ever change data formats, APIs or add features.  And the golden rule as far as Postgres is concerned, we always consider it safer to upgrade a minor release than not to upgrade, and I think that’s really important.  We sometimes find people who have sort of stayed on the dot one release for ten years.  You get heart palpitations trying to give them support. How many possible bugs are in that version you’re running that we fixed long ago.  Once a major release is out, from that date we consider it fully supported in the community for five years.  I know some of the vendors that work with the community, there are various companies that provide Postgres support, most of whom we know well and contribute to the project.  Some of those we’ll support for longer periods, seven or even ten years, but obviously that’s paid-for support. 

Stuart Langridge:        Yeah. And a major release in this context is 9.4 as an example, so not 9. 

David Page:                9.0 was a major release, but if either the first or the first and second or just the second digit change, that’s a new major release.  The third digit is always a bug fix release, and with the installers, we also have a build number on the end, so it’ll be sort of 9.3.4-1. 

Stuart Langridge:        So I’ve got Postgres from my choice of place, either from the app repository or from using one of your installers, one of the big things that’s happening now is people deploying stuff into the cloud, so instead of just building things on your own data center, you’re deploying to EC2 or Rackspace or any one of HP clouds or any one of a hundred different cloud providers.  Are you seeing a lot of people doing that with Postgres now and what kind of work are you doing to make Postgres good in that sort of environment?

David Page:                Absolutely, we’re seeing people doing it.  To be honest, there’s not a huge amount of work going on in the community on that front, because Postgres just works in the cloud.  It doesn’t care whether they’re on virtualized hardware or real hardware or a virtualized machine that you know the physical host for is in the room next door - it just doesn’t care.  Within EnterpriseDB, we have a product called Postgres Plus Cloud Database, which does take advantage of some of the cloud features and actually is basically a management system that will let you do one click deployments of clusters of servers with replication between them.  So you have a read-write master and then one or more read-only slaves with load balancing and what have you over the top of it.  We add cloud-like features to that, such as auto-scaling so we can automatically increase the amount of storage you have.  We can spin up additional read replicas if you need additional capacity, that sort of thing.

Stuart Langridge:        Yeah, and I believe you’ve been working with the Open Stack project as well.

David Page:                We’re using it a lot internally within EDB for testing the Postgres installers and for testing the proprietary version of Postgres as well.  But also we’re in the process of porting Postgres Plus Cloud Database across so that it will be available as a service that some of our larger customers can deploy in their cloud infrastructures. 

Stuart Langridge:        We talked a little bit earlier about Postgres being easier than we think it is to set up and so on and so forth.  What about the level of experience you’d expect someone to have to run Postgres in production, if you bring it into your organization or install it in the cloud for your organization?  What kind of level of experience do you think people need with Postgres to be a good assessment, to be a good DBA?  Is it something that’s possible to do as a relative novice or do you think that you need strong sys admin skills to get the best out of it?

David Page:                It really depends what you’re doing with it.  I mean, if you’re trying to run a 10 terabyte database and you’ve got 100 users connecting to it constantly and running data warehouse-type queries, then yeah you’re going to need someone who is familiar with tuning Postgres and familiar with tuning the operating system and the hardware to get it to run well.  However, most databases aren’t that big, and we hear about web scales so much these days, and it always annoys me because so many of the systems I see and come across on a daily basis are nothing like that big. I have customers running 10 megabyte databases.  They’re increasing but they’re very, very tiny.  In reality, I think the majority of the databases I see are probably just in the range of a few gigabytes, and on modern hardware, you can install Postgres and run a typical app, say a helpdesk or an asset management system that’s used in a company by a couple hundred people ,with virtually no skills at all. 

Stuart Langridge:        That’s ideal for someone like me if nothing else, but it’s useful to hear that.  So you’ve already spoken a little bit about jsonb and moves towards the no-SQL environment and so on.  Where else is Postgres going next?  What are your plans for the next six months, next year, long term plans? 

David Page:                It’s difficult to say, one of the interesting things about working in an open source community is that there doesn’t tend to be a long-term road map because all of your developers are volunteers, so they tend to scratch whatever itch they’ve got at the time as a general rule. We’ll take code for new features as long as they make sense and are reasonable features and properly written, etcetera, we’ll generally add them to Postgres.  I think the big trend at the moment really is more towards clustering, the bi-directional replication project for example.  People have been looking and working on using the foreign data wrappers for sharding.  I think in general we’re going to be heading more in that sort of direction.  That said, there’s always more SQL features to add.  There’s always better ways to write the optimizer, there’s projects like Parallels that one of my colleagues in EDB are working on. Actually two of my colleagues are adding support for Parallels within the server, so that it can have multiple processes running concurrently to handle the same query, that sort of thing.  There’s plenty of work for us to do in lots of different areas.  As other technologies around us change, there are going to be more things that people want, new ideas that people come up with, so I don’t think that we’re ever going to say that Postgres is done, it’s finished. 

Stuart Langridge:        At no point do you get to "down tools", I'm afraid.

David Page:                Right, until I retire.

Stuart Langridge:        And one final, totally critical question, we’ve been discussing the project and calling it “Postgres” but it’s actually called PostgreSQL or is it called post-gres-sequel or –

David Page:                No, it’s not post-gres-sequel, it’s not “postgree” as we often hear, the official project name is PostgreSQL, where postgres is one word obviously and QL on the end.  However, the word postgres has been officially accepted by the core team as a short name for instead.  So either Postgres or PostgreSQL. 

Stuart Langridge:        Hooray, I shall continue calling it Postgres.  Excellent.  Thank you very much for talking to us, Dave, and where would people go to find out more about Postgres? 

David Page:                www.postgresql.org

Stuart Langridge:        Excellent, thank you very much indeed.  Dave Page, of the Postgres core team. 


[End Audio]