One company I was at did a merger with another startup, and most of the other company's engineers quit. Amongst the piles of Visual Basic we found a stored procedure that was about 5 pages long. It took about 18 hours to run; its job was to do a daily report.
I hate being afraid of code. I spent a day with it, got to understand it, then rewrote it as a couple of queries and some Java code, whereupon it took about five minutes to run.
[... then there was the guy who implemented bitwise AND and OR by precomputing some 65536 entry tables. Wow. Why do I find all the really howling bad stuff so close to databases?]
"Why do I find all the really howling bad stuff so close to the databases?"
Because the database and queries against were were coded as an afterthought.
Because the programmers ran all the db access tests against tables with 4 rows of data when the customers would windup with 100 million rows in production.
Because most undergrad education around databases is poor and antiquated. Plus, it is not SQL focused - much time is spent on talking about data modeling. Not wrong, but not often helpful in practice.
Because DBA's are (rightly) focused on making sure your db is properly backed up, replicated, and fault-tolerant. They may not even understand SQL queries beyond the basic one table select.
Because the users will force data into the systems that the developers never anticipated.
Because db jobs are segregated by product for development. If you've done SQL coding on Oracle for the last 10 years, good luck applying for a job doing SQL coding on MS SQL Server.
Because SQL hacking doesn't get much respect. I've been in job interviews where deep into the process they ask me a question with data in two tables that requires an outer join. And answering correctly differentiates you significantly from other applicants. I guess it is the "fizz buzz" of SQL these days . . .
I have a slightly stupid question. How is it that we've come to a point where we think of application and database development as separate things? I think this sets the stage for a lot of the problems you're getting at.
On the one hand we make-believe that all you need to know is how to sling code and as long as you can somehow get your data into the database and out, then you've done your job as a developer. I think that's bullshit. No one would use your application without the data, so the database is as much of your responsibility as any other part of the system. You should know how to use it well and should take responsibility for its quality.
On the other hand, there are things that belong in application code, rather than, for example, in a stored procedure. Most things you could do with a stored procedure probably belong in application code. The app is where the business logic is, the database is where the data is. As a DBA you aren't doing your job right if you feel OK about implementing business rules in SQL. It's much poorer at expressing intent than an OO language would be, it's less readable, and doubles the number of places you need to look for domain logic.
How is it that we've come to a point where we think of application and database development as separate things?
My take is that the principles of database engineering only make intuitive sense at scale.
If every programmer started their career as an assistant DBA at a web host with a lot of traffic, or a data warehouse with a ton of transactions and report generation going on, or a company where lots of different groups were developing client code against the central database, every programmer would be inspired to understand databases.
As it happens, programmers start their career building little apps with one client at a time and a handful of data. And many of them finish their career building such apps, because they are very handy. And these programmers may never feel a burning need to know how databases perform with large datasets, or a huge stream of colliding transactions, or a mission-critical data integrity and safety requirement, or a variety of clients all trying to generate complex reports.
I agree with your point, but with an important caveat.
You are right that application developers should not be ignorant of databases, and DBAs should not be ignorant of what application developers are doing. All too often, this is the case.
However, the point when people came to think of application and database development as two separate things was the origin of databases themselves, and arguably one of the most important historical points in computer science itself. The notion that data has a life outside of any given application, and that it deserves itself to be managed, was and is incredibly powerful.
The same reason we have front-end developers who don't know how to write business logic, and back-end developers who don't know how to write a presentation layer.
The app is where the business logic is, the database is where the data is. As a DBA you aren't doing your job right if you feel OK about implementing business rules in SQL
Sigh... This again. Out here in the real world, we have databases with a hundred apps connected, with tens of thousands of concurrent users, and these apps are written in a dozen or more languages, there's COBOL, VBA, C++, Java, Python... There's green screens, desktop apps, websites, embedded devices...
The ONLY way to have consistent, enforced business logic, from ANY application in ANY language (even people executing SQL directly) is to do it in stored procedures. Doubles the number of places? Your way is 100x the number of places. Little websites with one application running on one database are not really representative of serious database use.
The ONLY way? I call bullshit on that too. Databases like the one you mentioned exist, but I still think if you have business rules you want to enforce across all applications connected to a database, you wrap that in a service and let the applications call the service. Clearly stored procedures aren't the ONLY way.
You're seriously proposing going back over 25+ years of applications and rewriting them all to talk to a "service" instead of the database? Over what protocol, CORBA, DCOM, SOAP, WSDL, some new thing? And what about third party apps that connect to the DB, have the vendors rewrite them too, to talk to your whizzy new "service"? And what language is your service written in, the fashionable one du jour, that in 10 years your successors will sneer at as "legacy" anyway? And it's as reliable and high-performance as your most mission-critical application, since now it's a single point of failure and a bottleneck?
Ho ho ho. Website experience doesn't translate to real database work.
I'm not proposing anything. My point is that there are other options.
As to the rest of your response, I don't mean to be rude but you're simply rambling. Obviously with a legacy system you work with what you've got, but the point is what you would choose if you had a brand new system and needed to make the call today. Choice of language depends on system requirements as we all know.
I really feel uncomfortable about all this posturing. I'm all for discussion, but if you honestly consider me a moron for saying what I have, you shouldn't have any interest in responding. Participating in a discussion merely to dismiss (and disparage) the other party is pointless and boring. Down-vote and move on.
It depends on many factors, "system requirements" is so vague as to be meaningless. Conciseness of expressing the algorithm is one, sure, so is the need to interoperate with other systems, both bespoke and off the shelf, so is the existing skillset of the organization and the talent pool it hires from.
It must be nice to only work on greenfield projects, but as I keep saying, it is not experience that transfers to large scale development efforts, which span decades and continents. If you spread this kind of misinformation, you create more problems than you solve, when the "business logic" is in a hundred places and no-one knows what bit of the system does what. Sure SQL isn't the perfect language, but it is the de facto lingua franca. And there is no "the app" for the business logic to live in. There are a hundred apps, each of which is a piece of "the app".
And if we're pointing fingers, it was you who asserted that a DBA who sees the big picture (funny how every developer thinks theirs is the only app in prod, when they are really in a cast of thousands) doesn't know how to do their job. Show a little humility and you will learn.
I work with large databases every day. I cannot tell you the amount of people I've interviewed that weren't able to solve a left join question. Let's leave right and full outer joins out of the discussion--a left join is as advanced as I dare get off the bat in an interview.
I've once had a guy I worked with ask for help when he was with a new company. I told him to look at the query plan. He replied, "Yeah, but that doesn't tell you much."
Relational databases are oft-misunderstood and underestimated time and time again. Where are the Codds of yesteryear?
I've never understood the typical programmers aversion to query writing. Writing SQL is in general pretty easy (basic set theory).
I'm guessing it's all about what people have experienced. I have been coding against RDBMSes since my first job while still in college. I have even read Celko for fun at one point :)
IMHO, learning the hard parts of SQL only come from experience with particular RDMSes and dealing with large datasets. For example, knowing that the only way to get a query to run against a large Teradata table without running out of spool space is to create and use temp tables only comes as a result from trying to run the query.
I think it's easy once you've got your head round the model of data sets but there are plenty of experienced programmers who, due to lack of experience with SQL, still drop back to iterating through large collections of data.
It's the fastest way for them to solve a problem in terms of development effort, but it often leads to poor solutions and poor SQL.
Most developers learned in school to program, i.e. writing algorithms. Most development in the "business" world is not that, it is basically managing and transforming data. Something that RDBMSes and SQL is very good at.
The result is that many developers will do selection and sorting algorithmically (taking the bugs and LOC that entails), instead of using declarative SQL.
I have worked at places were the database is just seen as something in the periphery, that you just have to deal with since Oracle is a requirement from customers. Its sick really when you think about it.
Im disappointed by many of the comments here that basically is "SQL sucks because I don't understand it".
In four weeks I've gone from not remembering the difference between inner and outer joins to isolating a bug in the Postgres optimizer... and yeah, it's all about daily experience.
I have no formal CS schooling, so thinking in sets is far from basic - I often have to stop myself from thinking in iteration. Reading Celko's "Thinking in SQL" book is in fact helpful (although why does he keep yelling at me? and what's FORTRAN?)
But it's very foreign and black-box-y to many programmers, and a tiny bit of SQL voodoo, like a tiny bit of Javascript voodoo, will get you a very long way with very poor performance.
I learned SQL with copy/paste/modify from PHPBB code base. When I got my first web programming job I could write basic select/insert/update/delete but knew almost nothing of joins. I learned the rest by reading MySQL docs on the job. (Later I also improved my understanding of joins by doing an MS Access project.)
So SQL is just as easy to pick up as anything else, but you're going to be horrible at it at first and need to get past the mental block of seeing something so different conceptually to "normal" programming.
One of the biggest hurdles that typical programmers must overcome is that SQL is a declarative language, and therefore quite different to the imperative languages they are used to.
Imperative programmers all secretly believe in a declarative style, that's why they compile with -O and hope the compiler's "query optimizer" will make it fast for them.
I work in a very heavy database environment, and it's interesting. Nobody here ever talks about "left" vs. "right" joins. We also don't use ansi syntax (oracle), so for a typical outer join, we'll do either:
select *
from foo f, bar b
where f.baz = b.baz (+)
or
select *
from foo f, bar b
where f.baz (+) = b.baz
But we don't call one a "left" and the other a "right" join. Just bringing this up to note that the right vs. left outer join distinction might be unclear to even relatively experienced engineers coming from such an environment.
This kind of notation (a comma-separated list of tables) used to be the standard. We've moved beyond it for a very good reason: there's a very common class of queries that this notation cannot express.
It's very common to want to LEFT JOIN two tables, and then see specifically which ones on the left had no corresponding record on the right. But the old way cannot specify the JOIN criteria separately from the filter criteria, so we can't express the query.
For example, in my current database, I might do the following to see what categories are not represented by any products:
SELECT *
FROM Products.Category AS c
LEFT JOIN Products.Product AS p ON c.CategoryId = p.CategoryId
WHERE p.ProductId IS NULL
By specifying the JOIN criteria in the ON, I can then use the WHERE to winnow down to the left-only rows. But if that were expressed in the old notation, it would be mistaken for JOIN criteria, and thus all categories would appear to have no products.
I find the JOIN syntax to be far more maintainable too, the WHERE clauses become really hard to read amongst the dross of boilerplate JOIN clauses from the salient WHERE clauses. Things also get confusing as soon as you throw an OR in the WHERE clause, too many unnecessary brackets.
Also, changing schema tends to be easier too as you can often just delete the JOIN line if you've been coding consistently instead of having to pick through the WHERE clause on more complex queries.
In the end though I guess a lot of it comes down to what you're used to, though I have been exposed to both and would definitely be in the JOIN camp and help utterly destroy you evil ,s if we had to have a worldwide battle to decide the fate of SQL kind.
Let's modify the query slightly, to retrieve all categories are not represented by any products from a particular manufacturer. The standard syntax would look like this:
SELECT *
FROM Products.Category AS c
LEFT JOIN Products.Product AS p ON c.CategoryId = p.CategoryId AND p.ManufacturerID = 123
WHERE p.ProductId IS NULL
How will you express this in your notation? In particular, what differentiates the "p.ManufacturerID = 123" of the join criteria from the "p.ProductId IS NULL" of the filter criteria?
I never ask for an explicit left or outer join in an interview. It's always a question, like, "Get me the total sales of every salesperson for this quarter given this schema." And then they return it, and I say, "Can I get the list of all salespeople, even if they haven't had any sales?" And that's where it falls apart, sadly.
unfortunately, I must agree with you. 9 times out of 10 candidates cannot answer the question you've posed. One time I even had a candidate with 17 years of experience (!!!) who didn't even know what a join was (masked in a question form). I let it slide, pretended like it was an "advanced" question, and came back to it later (figured he was nervous). The second time around I phrased the question "so, imagine I want to join data from these two tables so that the output would be first name, and location" (2 different tables). Even then, the candidate did not get the hint.... Instead, he offered to change the schema of the fake database and have all the information in one table....
I also find the Oracle outer join operator (+) easier to use.
I just searched the full (Oracle) sql codebase of the project I'm working on, and it seems that there is no clear winner in the "outer join" vs "(+)" battle (outer = 46%, (+) = 54%). There is no code style requirement for joins.
I have trouble with the oracle syntax once you get to two or three tables left joined against each other, and consider it more a relic of old oracle programmers than any actual advantage one way or the other.
I think I read recently that using the Oracle (+) operator can give you the error "cannot outer join to more than one table" if you try to outer join to two or more, but the ANSI syntax will allow you to do that - I really should test this out ...
edit - I tried this out and right now cannot get that error, so not sure where I got that idea from.
I don't think I've ever used a right join in a production query (perhaps a couple of times when I was doing some digging and it was easier), but I don't understand your statement. Left joins are insanely useful. For example, getting all the sales for a salesperson:
select
sp.Name,
sum(s.Amount) as TotalSales
from
SalesPerson sp
left join Sales s on
sp.SalesPersonKey = s.SalesPersonKey
group by
sp.Name
order by
TotalSales desc
If you did a correlated subquery, it would take a ton of time to complete (`select sum(Amount) from Sales s where s.SalesPersonKey = sp.SalesPersonKey`), especially on large tables.
Left joins (and full outer joins) are plenty useful and I use them almost daily. Care to explain what you mean?
What he (probably) means is that the following query returns the same result as your query:
select
sp.Name,
sum(s.Amount) as TotalSales
from
Sales s
RIGHT join SalesPerson sp on
sp.SalesPersonKey = s.SalesPersonKey
group by
sp.Name
order by
TotalSales desc
That's what was meant, I assume, but it's misguided. Right joins don't create a "code smell" for technical reasons, but for practical reasons. Because left joins are the standard approach, if you're using a non-standard approach to a common problem you'd better have a damned good reason (because you're just confusing maintenance programmers... and thus causing errors... with no justification, otherwise).
Me too. Most programmers don't understand the "brain" of a select statement. People gtalk me all the time when they are stuck and the answer is often times to remove 90% of the query and replace it with a GROUP BY and a HAVING. Most people don't think of the steps the interpreter goes through because it is a declarative language.
When I interviewed for my current position (which involves database design/query writing/storage/etc), the interview was heavily SQL-focused. My would-be manager started with very easy, general questions, and then continued to plumb the depths of my knowledge 'til he hit bottom (took about 30 minutes[1]). He is of the opinion that if you can think on your feet about general database design and set theory, you're a pretty competent programmer for the work the company does.
[1] Though I apparently did well enough to get the job, I felt as if I utterly failed the interview.
Can you explain each of them, when they are typically used and why they matter? I've used them before but honestly, I am not that comfortable with them and don't ever think about using them.
I cannot help but think that the recent flavor-of-the-month rush to all that NoSQL is nothing but the average web monkies who do not understand databases, SQL and the fundamental problems RDBMS are dealing with so, yea, when I have no idea about all that I would prefer a network-accessible key-value map as well...
"Because most undergrad education around databases is poor and antiquated. Plus, it is not SQL focused - much time is spent on talking about data modeling. Not wrong, but not often helpful in practice."
I am taking a database management class this semester. Its awful. We spent over half the semester drawing diagrams. Just last week we started working with actual SQL.
The & and | operators weren't good enough for him?
That stuff ends up in databases because on your average team the knowledge of databases drops dramatically when you get past selects and maybe left joins. I worked with a few developers who wrote and lived with queries that were taking 30 seconds each to run on their local machines because they couldn't imagine how to fix them and just waited for me to get assigned to the project in a week or two so I would fix it. They just kept raising the timeout values until the pages rendered. The problem was simple in that they were accessing a table with 30 million rows without using any index at all so changing it to use the clustered index made the query return almost instantly.
I've seen similar things where people come up with crazy ways of doing your most basic database tasks.
So show your coworkers how to run the query-plan dumper for your DB brand. The curtain is torn away, and the underlying ISAM is revealed for all to see to allow the needed "hints", indices or restructuring to be understood.
As my prior boss used to say... "Give a man a fish, feed him for a day. Teach a man to fish, and you'll have to answer fucking fish questions the rest of your life."
Second this - I've cooked up some weird SQL to do neat things.
Other devs on my team had real problems with it - they were stuck and found it really difficult to "think in sets." They kept asking me about how to use cursors to go through data . . . heck, they may even have been right from a clarity and maintenance perspective.
But the object-oriented part of my brain has apparently been excised by spending so much time fishing (SQL). So now I have too many fish questions in that direction.
You're assuming the other people want to learn. Generally, if people want to learn something, they ask questions. Telling co-workers how to do something when they don't ask can be... tricky. It depends on the environment you're in, your relationship with the people, and the people themselves.
I've had more than just a taste of this myself recently. I'm no database expert but I know how to build a query, normalisation, relational integrity and all that. Enough to identify bad SQL in a CRUD application. And I'll take the time to learn more about it and how to better construct a schema and query (and whether or not an RDBMS is even appropriate for the application).
With this in mind my boss mentioned one of our sites was "dying on its arse." It was hammering the slow query log, pages could take 10, maybe 20 seconds to load.
Turned out there were NO indexes at all. Maybe the odd primary key from the default CMS install (which was by no means database efficient). So we sorted that and that sorted out the immediate speed issues.
Of course, it went without saying that a poorly constructed schema must be related to a poorly constructed query or two. We were actually wrong. It was more like a query or twenty.
The submission of one form was done field by field. There was an UPDATE query constructed for each field and processed there and then, on a form with at least 12 fields. There was a comment above, "We have to do it this way. Trust me." As far as I could tell they didn't know how to dynamically build a query.
Further along, different update queries were performed by first deleting the record, and then re-inserting it with one of the fields updated.
I never quite got as far as that. There was no comment anywhere about the reasoning for such a strange approach.
I think some developers don't bother figuring it out because they think there are better things to do than optimise your code.
Hi there! I made mostly random changes until it worked, so I have no idea why this contorted approach fixes the earlier failures. But I spend two weeks on this section, so don't touch it!
[this comment removed and replaced with the "Trust me" line, because maybe multi-line comments are breaking things today...]
We have a legacy database we support that creates a new dynamic stored procedure (with the same body) every time it performs a particular operation. How you can have enough knowledge to do this, but not to use a single procedure is beyond me.
I tried caching the hell out of Rails, kept adding indexes to my SQL tables, created 4 MySQL replicas, added more RAM to my linodes, tuned many default parameters in my.cnf, but the website was still slow to a crawl.
This parameter saved it all "innodb_buffer_pool_size = 768M" ... how am I supposed to know that Rails creates InnoDB by default, i thought MySQL default was always MyISAM.
This is changing because MyISAM is not really a true database (in the ACID compliant sense of the term). I guess I would have assumed they were MyISAM too but they don't perform differently enough at low usage levels that you would notice it right? How big was this was application?
Dumping the schema from the database itself will reveal the table type. Knowing the physical organization of the schema is required for optimization work. If you only used the logical organization information (UML diagram for example) you were doing it wrong.
In real RDBMS systems, tables can be clustered (index organized in Oracle-speak) or not, nested or not, in different tablespaces, with many different parameters that do not change the semantic of DML/queries but change the performance significantly.
Yes i know, but it just never cross my mind that my tables have been InnoDB all these while ... among all the optimization, this is one i overlook, well i'm not a DBA afterall.
Isn't rewriting a stored procedure into multiple queries plus Java code sort of the opposite of what most DB best-practices advocate? Now your query logic is in a mixture of SQL and Java, rather than all in SQL, and you've hidden some of the logic from the database's query planner by moving it into application code.
(On the other hand, I've rarely found query planners, even for Oracle, to be as magical at optimizing as one might hope, so maybe that's why.)
Sure, but I took an unmaintainable stored procedure (with very few comments, by the way) and turned it into something that was small, well-documented, and that ran over two orders of magnitude faster.
If this is wrong and goes against "DB best practices" then I guess this answers my question about where the howlers come from.
[We had a project under development. At some point, and an utter surprise to the team, an Oracle consultant showed up one day and said, "I'm here to tune your database."
"Huh?"
And he did. Against a half-built system with a schema we were still designing, he "optimized" our queries and determined buffers sizes and whatnot, and then he went away.
Apparently the people in Sales were panicked that we had a database under development that we hadn't yet made plans to tune, so they had hired a guy to come in and fix things.
We tossed all of his scripts and rolled back everything he'd broken, and continued coding. Total waste of money.
Databases are a weird confluence of the power-mad, the knuckle-draggers, the money-grubbers, and a few techies who know what the heck they're doing.]
The performance problem with using multiple queries comes mostly from the network latency and query overhead that happens every time your application needs to make a call to the database. Prepared statements minimize the later and connection pools help with the former. The primary thing to worry about when writing queries is whether the logic behind them is efficient and doesn't for the database to do stupid things or fetch the wrong information. (Like using count( * ) where exists is what you mean.) Paying attention to what your queries will make the database actually do is more important than following any guidelines.
I'd agree. I'd also add that well written stored procs are often more readable than well written ad hoc queries.
The longest query I have ever written I thought was approaching unmaintainability at about 100 lines, but I can sit down and digest it without too much effort. I have however spent a week debugging a query that was three times that long. I think the time to understand starts increasing with the square of query length...... 100 lines may not take too long to understand, but 200 lines takes 4 times as long, 1000 lines takes 100 times as long, etc.....
At least my experience is that even well written SQL starts becoming a bear really fast when over 100 lines ling.
Some databases really do loose their mind when the query contains many tables. Sybase was particularly a pain and pretty much required breaking up big queries. Sybase's optimizer for version 12.5 and earlier (cannot speak after that) often required forcing indexes and plans in stored procedures.
"Wow. Why do I find all the really howling bad stuff so close to databases?"
I have seen a few of those, but you also have to include not only stored procedures but also table layout etc. Things like "if I don't define my foreign keys, I can use a foreign key to reference rows in EITHER table a or table b!"
And then of course when you write sprocs against that mess, you get more messes.
The thing is that databases are fundamentally math engines with some extra services tacked on. This means that designing databases is ideally a mathematically sound representation of your data not specific to your application, and the database queries bridge the gap. Most developers don't understand this though and so as the application changes, you get big messes.
Now, obviously the above is an ideal which really can't be met in most cases. However the closer one is, generally the more ideal the situation, and most apps are really far from it.
"Wow. Why do I find all the really howling bad stuff so close to databases?"
BTW, as a counterexample, I saw one developer figure out how any invoices were closed by retrieving millions of them from the db and checking in Perl. All for the want of a HAVING clause.... Needless to say while this worked ok for small amounts of data, it failed with large sets.
I've made a few SQL errors in my past that failed on large sets, but fortunately I was working against large sets, and they failed miserably, and we had to figure out why. I got good advice from an expert and fixed the problem. In many ways keeping queries separate, and focusing on readability/code standards helps a great deal I think.
The worst problem I've seen in a similar vein was a build process that involved munging a large (200+ mb) xml file. Someone had written a little JS tool for the purpose, and later the use of it had expanded so that it was being run over 150 times during the build. Since it took about 4 minutes (each time) just to read and save the xml file this meant that the build was now spending around 10 hours every day just munging the xml file. I rewrote the JS bit so that it took in a config file and could do all the processing necessary in a single pass, which only took 4 minutes.
I hate being afraid of code. I spent a day with it, got to understand it, then rewrote it as a couple of queries and some Java code, whereupon it took about five minutes to run.
[... then there was the guy who implemented bitwise AND and OR by precomputing some 65536 entry tables. Wow. Why do I find all the really howling bad stuff so close to databases?]