Being a software engineer I'm often segregated off to a group of people that work in dim light and subsist on Doritos, caffeine, fast food, video games (an often over looked food item to this community, but a developer can experience a kind-of zombie-needs-brains hunger if they are without a game controller in hand) and beer. And while I can (and have from time to time) indeed lived on such things, this engineer loves to cook.
As with many cooks, Julia Child is the prophet at whose kitchen table I worship. With her 100th birthday celebration last month, I've been thinking through a lot of her teachings and how they affect my life both in and out of the kitchen. So please excuse my indulgence here as I explore databases and data design via the wisdom of Julia Child.
- "Theoretically a good cook should be able to perform under any circumstances, but cooking is much easier, pleasanter, and more efficient if you have the right tools."
The same should be said of a good database administrator, DBA, and designing a database. Yes, he could design a data warehousing solution with simple text files (he may also be secretly planning inhuman forms of torture for you, but he could do it). Over the last 5 to 8 years, there has been a surge in "alternative" database solutions in the area document-oriented, distributed, graph, unstructured-data databases; with such an influx there are many to choose from besides the standard RDBMS we all grew up with. Each of these solutions meets a specific need; in fact some are highly specific in what solution they can solve.
As with most cooks who absolutely must have all the latest gadgets and soon find themselves with the world's most expensive potato peeler, we technologist find ourselves reaching for the latest, fancy, cutting edge database solution without necessarily looking at what the actual needs are. And while, yes, it may work; you have to ask yourself, is it the best solution? Typically with database solutions you need to look at the areas of: reading from disk, writing to disk, indexes needed and size, storage needs, types of queries and how they will access the data, and the timing of transactions. More questions can be accessed about locality, distributed, disaster recovery and the like, but most solutions have some form of this already solved.
It is also a misconception that you need just one database solution. As data warehousing needs grew, so did the acceptance of having different data solutions under one roof and even storing the same data in different schemas. Now, with data being pushed out to cache servers and search engines, there is no reason to off load routine operations like querying to these solutions. It is just a question of how quickly your data is updated and how quickly the results are needed. In fact, some groups have taken to replacing their data storage solution entirely with a Solr/Lucene implementation.
- "Always start out with a larger pot than what you think you need."
As with cooking, it is often easy to under-estimate your needs even if it mathematically works out. Two quarts of liquid will fit in a two-quart pot, but you are going to make a hell of a mess when you need to stir it or it starts bubbling. The calculating of system level needs is fairly straight forward, and there are many online guides to help you get started; but mostly it comes down to hard disk and memory. Remember that RAID 10 is your friend; especially on EC2 instances with EBS storage, it helps mitigate the IOP performance. Though more expensive to do RAID 10, it is like investing in a good pot or pan, yes it will cost more but it will last you for a very long time.
While a good rule of thumb is to have a lot of memory, mainly it is that you need enough memory to not only support the OS and other applications running on it but to hold your most used indexes. If they don't then when a query needs to perform a full index scan, it will need to pull the parts not there into memory and remove parts already there; then the next scan has to do the same (basically pigeon hole theory with evil city pigeons). Consequently your index is doing more work and causing a hit to your hard drives. Some database solutions have tools to help determine the current usage and provide recommendations for allocations. Other database solutions use things like memory mapped files, which allow your data to be treated as if it was a big swap space allowing the kernel to help with the swapping and aid in performance, however even with these you want your indexes to be able to fit in memory.
At the very least, recognize that your current design will fail to scale at a certain point. There is not a design or solution that will take you from one user to 5 billion. Even the largest technology companies have had to adjust and readjust. Often it is a mixture of server costs (maintaining large under utilized server farms is unrealistic) and technological capabilities (most of the recent technology available has been from large social networking and the like needing to support their load). Most database solutions have some level of distributed nature to them; as you adjust in what you are providing and growing your application, you will need to constantly explore and test solutions and adjust your implementation.
- "Cooking is one failure after another, and that's how you finally learn."
As a DBA and as a cook, my greatest learning experiences have been my biggest blunders.
Too often I've seen this overwhelming fear to rework parts of data model implementation or even change up database solutions. Even the suggestion of upgrading a database server can cause some DBA's to twitch. Yes, your database is the foundation of your application and changes to it mean changes which could impact all the layers above it. Simply adding columns or indexes to a database is not growth but maintenance stagnation.
The best way to make sure you can grow your data solution is the data API that exists on top of your database. While this API will most likely need to change, having it centralized, defined, and controlled mitigates a lot of hunting, massive manual code refactorization, and death threats. If the data model is unable to be changed then your implementation is broken. The ever-growing so-called schemaless database solutions (so-called as they do usually follow some sort of schema) are designed around the fluidity of your data needs. These types of designs help significantly in storing sparse data (especially with likes of use data which has so much optional data for them to fill in) and textual data.
- "Always remember: If you're alone in the kitchen and you drop the lamb, you can always just pick it up. Who's going to know?"
While hopefully the cook following this advice doesn't have flesh-eating staph on their floor, the point for the DBA is that as long as a user's experience is unaffected it doesn't matter what is going on with the database solution. As long as the user is able to get the data they need quick enough and it is secure, data design and infrastructure design is immaterial. And most importantly they need to be hidden from the disasters on the backend. Often you only learn about the issues when they happen and are affecting users, but there are often growing pains. However, after you stabilize, you should be flexible to handle the potential issues that spring up, or land on the floor. Often this is done with replication or failing over to some sort of cache data setup. As the database is the corner stone of your application, it's failure can impact all tiers and can be a mess like pureeing a hot soup in a mixer and not holding the lid down (for the unlucky, butternut squash raining down like napalm is a memorable disaster).
Julia Child changed what it meant to cook and eat. Such a feat has lessons that extend to all walks of life. It takes a lot of determination and commitment to design and make a product and make it amazing. It took Julia Child years and a lot of rejection before Master the Art of French Cooking was published. Within a startup the same tireless work and rejection will also be required, but a good product will eventually win out.