Lesson learnt from data cleansing. Part II
Ok as I have done the part I I have learnt another thing from data cleansing. I believe the best tool for data analysis is R but the learning is quite steep but I’m trying to learn it right now. Hopefully if I can get some results from this with neo4j I could move on to use R to get some more statistical results.
Now, the second challenge I got was that the CSV file is about 1GB with around 100 columns and I was trying to use Ruby to parse that. Obviously, I got Killed results from my Terminal even before I got to the first line. So, Ruby has to load everything into memory and parse that. But as I’m running everything in my small Mac. It seems to be quite a challenging. So, what I have done is to dump everything to MySQL. This time works really good. MySQL has LOAD DATA FILE which can load CSV file into a table. It only took me around
First attempt
Index joined columns and changed the table schema to something small. I used VARCHAR(50) instead of VARCHAR(255) and of course I used NUMBER(3) instead of VARCHAR(5)
Still, I couldn’t get the results back in 1 hour which I think MySQL is better than that.
Second attempt
So, I moved on to the next one. It looked like everybody on the Internet suggested key_buffer_size or key_buffer to 4G. And changed some of the variables to pimp MySQL up a little bit.
* query\_cache\_type = 1
* query\_cache\_size = 128M
* query\_cache\_limit = 128M>
Still, I couldn’t get the results back.
And suddenly I caught something in my my.cnf
# Example MySQL config file for small systems.
# This is for a system with little memory (<= 64M) where MySQL is only used
# from time to time and it's important that the mysqld daemon
# doesn't use much resources.
#
What is said was that XAMPP is tuned for a very small computer to be able to run basic PHP and MySQL application which is why my MySQL performed really bad in this case.
Final attempt
I started my EC2 instance with 7GB of memory and installed MySQL (normal one) also some of the performance tuning variables. Now, I got the results back in just under 20 seconds to join two tables with around 3 millions rows.
Lesson learnt: Do not use MySQL from XAMPP for data that is big and use query that would get smaller results. For example, in my case instead of filtering out user_ids that don’t exist in Users table. I filtered user_ids that exist instead.
I also thought about of using BigQuery from Google but I guess my data is rather small and can hardly be called big data. Next time, I might try to use BigQuery just for fun.
Til next time,
noppanit
at 00:00