Let’s supposing that you have log files of some sort pouring in and you want to put aggregate data representing the logs into an rdmbs. To begin, let’s start with a blank slate, i.e., just dumping the data in. And let’s have a simple table, that in mysql is created via
CREATE TABLE `history` (
`id` int(11) NOT NULL auto_increment,
`hits` int(11) NOT NULL,
PRIMARY KEY (`id`)
);
I did a pass each for both MyISAM and Innodb with a million inserts.
engine
|
queries per insert
|
seconds (lower is better)
|
| MyISAM |
10000
|
7.046952963
|
| MyISAM |
1000
|
7.342753172
|
| MyISAM |
100
|
8.521313906
|
| MyISAM |
10
|
31.44731498
|
| MyISAM |
1
|
135.3045712
|
| MyISAM |
load data infile |
4.927606106
|
| Innodb |
10000
|
19.76374817
|
| Innodb |
1000
|
30.58060002
|
| Innodb |
100
|
89.54839206
|
| Innodb |
10
|
723.135994
|
| Innodb |
load data infile |
17.25715899
|
A multi-value insert for three values looks like this
INSERT INTO today (hits) VALUES (?), (?), (?)
Then I execute with the three values.
The fact that inserts with 1000 values start to approach the load data infile numbers is a little compelling. But let’s suppose that we want to do every insert from a bulk load but we want to have a table (like history above) that has aggregate data, += style. Is it possible? Sure.
Here is one approach for mysql:
- Create a temp table, which I will call today
- Bulk load the data into today
- Run the query INSERT INTO history (SELECT * FROM today) ON DUPLICATE KEY UPDATE history.hits = history.hits + today.hits;
- Drop today
I would like to apply this strategy and contribute some pig code that allows for bulk insert. This would (I think) allow for some pretty large scale aggregating all from with a “simple” pig script. Would also like to start using chukwa, but it looks a little tough. I think the architecture would then look something like
web servers -> chukwa -> pig -> mysql
Think then I would be pretty well at yahoo! or facebook scale.
Guess we’ll see how it all goes
Enjoy!
Earl