J.O.A.T.

Thursday, August 09, 2007

PHP Unable to Connection to MySQL

Over the past few weeks we've run into an issue on flora where people attempting to any database-driven content are greeted with a MySQL connection error. The connection problem is intermittent and seems to run its course after a few hours. During that time I've been asking people to click the reload button in their browser; this usually results in them getting to where they were headed. Luckily, we're not hosting any content for the general public or I wouldn't be able to rely on this solution.

The first time it happened I couldn't find anything in the logs that led me to a solution. Since the problem eventually went away, and because I had no way of really seeing what was going on at the time I let it slide. However, about a week later another bout of connections errors started occurring. We'll be moving some high-use, general-public content onto the server in the near future and so now a solution needs to be found.

So far I've been unable to trace the source of the connection errors. There's nothing indicating a problem with Apache or MySQL (such as MySQL crashing). We are logging PHP errors and here's the relevant error message:
PHP Warning: mysql_connect() [function.mysql-connect]: Lost connection to MySQL server during query in %DBC% on line 8
%DBC% would be the connection script and for this particular file line 8 is the mysql_connect() statement. Nothing too helpful here except to note that we're not even making the connection.

One of the users had provided a printout of the error received on the front end. I don't recall there being anything helpful, but I tossed the print0ut after the problem righted itself. A mistake since the printout included output from mysql_error(). Though I can't say for certain what kind of error was being reported I don't believe I would have tossed the printout if it was truly useful. At any rate, I modified the connection script so that mysql_error() is recorded in the PHP error log. This won't help currently, but if/when connection errors are encountered in the future I'll have a little more information to help with debugging.

After the initial occurrence I spent some time looking for information on the web, but nothing I found showed any promise. Most of the references I found regarded too many connections or too much data or similar issues. Those just doesn't seem to be the case here; I don't recall any of those being part of the error reported.

After the second occurrence I did some more searching and believe I am closer to a solution. If my recollection is correct about the nature of the error reported by mysql_error() then I suspect we may just be running into a connection timeout. The server has been running fairly sluggishly of late and it does seem possible that PHP may be unable to connect by the timeout limit.

MySQL is set to time out after 5 seconds. But in reviewing the logs and status information I don't believe this is the source of the problem. MySQL should be reporting any connection timeout on its end as aborted, but the number of aborted connection reported by MySQL (10) is significantly smaller than the number of connection errors reported by PHP (~200). With this in mind, the most likely timeout is occurring on the client. What this means is that MySQL is not responding at all by the timeout set by PHP, 60 seconds.

The solution would be to increase the timeout on the client, server, or both. However, I don't want to make this change prematurely. I'm thinking of waiting until the problem emerges again in order to see what kind of errors are thrown by mysql_error().

If I am unable to find a solution the next step might be to try posting a query on the MySQL Community and MySQL Performance Blog forums.

References:
Alternative sources
I do have some thoughts on other issues causing the connection problems. These, however, seem less likely. I am listing them here mainly as a reminder to myself in case the connection timeout doesn't pan out.

Perhaps PHP isn't freeing connections and is running out of usable connections. The server is currently set to 100 connections by default and we don't get nearly enough traffic to approach this limit. Plus, if PHP acts appropriately it should be disconnecting the connection once it has completed processing a page. However, it is conceivable that something is going on that is preventing this from happening. If PHP isn't disconnecting when finished with a page the connections could inch up towards the limit.
Another possibility, maybe MySQL is running out of back_log connections. The back_log connections are those that have been connected to MySQL but have yet to be handed off to a child process. This is currently set at 50, which should be more than enough to handle our traffic load.

Thursday, July 26, 2007

Firefox List Re-numbering Bug

Firefox has an odd error for lists that have the column-count style applied (-moz-column-count since this is still experimental). When a multi-column ordered list contains links the numbering of the last column can get corrupted when a link in that column is clicked.

Now, admittedly this is still an experimental feature, but it's probably a bug that needs to be addressed since this is expected implementation of the spec as it currently stands.

I've created a test-case to demonstrate the bug.

I'm rolling the dice and submitting the bug to bugzilla. Hopefully I won't make a fool of myself again.

Friday, July 13, 2007

Optimizing MySQL

Slow on the draw
Some of our users were complaining of slow performance on one of our web-based utilities. There are, unfortunately, a number of factors that can affect system performance. I haven't spent much time learning the best practices for profiling a system to discover the source of bottlenecks, but my hunch was that the problem was in MySQL.

The users in question were having problems with slow INSERT/UPDATE statements. I used the status and variables tabs of PHPMyAdmin (PMA) to try and find any possible bottlenecks in MySQL. These tabs are equivalent to the SHOW STATUS and SHOW VARIABLES SQL statements respectively. The nice thing about PMA is that it will highlight any values it thinks may be causing a problem.

At first I thought locking might be the source of the slowdown, but after some testing I don't think that's the case. In my informal click test (clicking the save button on the form being used over and over) the worst delay I ran into was about five seconds. The delay time being experienced by the users was in the neighborhood of thirty seconds.

When I thought about the delay more I remembered a similar problem I've experienced myself when logging in to one of the utilities. The log in could easily take up to 30 seconds at times. It took a while to nail down the possible culprit, but I believe it may be the MySQL table cache. These are the tables MySQL has accessed that it stores in memory in order to speed up future access. After a bit more testing I've come to the conclusion that a significant slow-down may be occurring while as MySQL removes one table from the cache and loads another one. By upping the value of the table_cache parameter in my.cnf I believe that the delay caused by this operation should be avoided.

I upped the table_cache from 64 to 80. I thought about doubling the value to have a bit of a safety net, but the drawback would be higher RAM usage. I'll have to watch the opened_tables status variable and see if the new cache value is sufficient.

I won't be able to see the affects until next week. The users say the problem seems to be worse in the mornings. This is likely the most probable time that the table cache is being optimized based on usage, with tables going in and out of the cache.

Staying off the disk
Table locking isn't totally out of the question, though. With up to five people entering data at the same time it's possible that a user could get stuck at the end of a locking queue waiting for the others to finish. There's not a lot that can be done to prevent lock conflicts. We can raise the priority of SELECTs or decrease the priority of INSERTs/UPDATEs. Otherwise switching to InnoDB might show something of a performance increase (from the user's perspective). Since InnoDB uses row-level locking the overall time a user spends waiting on locks could decrease significantly. Of course, InnoDB isn't without its drawbacks in the form of higher overhead and the inability to perform full text searches. (Full text search is not a concern for the table involved in this instance.)

While I was modifying the operational parameters of MySQL I went ahead and updated the tmp_table_size option as well. The number of temporary tables being written to disk was relatively high when compared to the total number of temporary tables created (almost 50%). This can cause sluggish response time as well so I'm hoping that updating this value (32M to 48M) will also show an improvement in server responsiveness. I have a feeling I'll need to push this even higher, but again I need to keep an eye on overall RAM usage. All this optimization will prove pointless if the system has to start paging out to disk because the RAM is full.

Keeping it real (fast)
One issue to keep in mind with regard to server performance is indexes. A join should always be done on an indexed field, otherwise there can be serious lag as the system seeks out matching records. We appear to have a large number of join operations running that do not use indexes. Finding the offending SQL statements will require a bit of investigation, however, and so I don't expect to address quickly. Also, I don't believe this is an issue in the user-experienced sluggishness so I'll probably address these as I come across them.

References

Tuesday, June 26, 2007

Partition Management Voodoo

We were given some additional hard drive space for our servers to help mitigate some space issues we were having. I wanted to keep the drive integration as simple as possible so I attempted to utilize the extra space to expand our current drives rather than adding new drives/mount points.

We have two different server environments, Windows 2003 and SUSE 10. Naturally the two systems require a different procedure. The end result, however, is pretty similar ... a drive mounted into the existing file system.

Steps on Windows 2003
Here are the steps I took when adding the new drive in Windows.
  1. initialized the drive as a dynamic disk
  2. formatted the drive using NTFS
  3. assigned a drive letter
The final step was to mount the drive into one of the data-heavy folders. This required a few steps:
  1. stop all programs/services that were accessing the folder in question
  2. rename the folder (I just use a .bak extension)
  3. create a new empty folder using the name of the folder you just renamed
  4. mount the drive into the folder
  5. copy all the files to the now-full folder; if you need to copy permissions use something like scopy or xcopy
Steps on SUSE 10
Our linux server presented a bit more of a challenge. We already had a number of mount points and our new drive was larger than any of our current mount points. I wanted to format the drive as a single partition to use in our most data-heavy folder as well as free up some currently used space to use in resizing the root some space-limited partitions. In all this would require assigning some partitions to different mount points, copying data between different partitions as their mount points change, and even moving some partitions. Here are the general step required:
  1. shut down any programs accessing the partition
  2. go to the partitioner in YaST (system->partitioner)
  3. assign the partition to a new mount point (such as changing /inet to /inet.bak; be sure to leave enough space for a temporary root partition if you need to expand the new root partition ... you can reclaim this space later)
  4. create a new partition on the new drive and assign it to the original mount point
  5. apply the changes and exit YaST
  6. unmount the partition as it will still occupy the old mount point; sudo umount /inet
  7. remount all the partitions; sudo mount -a
  8. duplicate the contents of /inet.bak into /inet; sudo cp --recursive --preserve=all /inet.old/* /inet
  9. restart the system and test for points of failure (particularly with system services)
When you're creating mount points, don't forget to create an empty directory in which to mount it. The only exception being the root partition. Also, if you have partitions mounted as subdirectories you want to make sure you don't copy the contents of those partition. The easiest way to do this is to create a new mount point for the partition you're copying. The sub-mounts will then be absent.

One of the reasons for going through this process was to free up space on the drive that contained the root partition. It has been close to running out of space for a while now and expanding that partition is important. Though it would probably be best to leave the root partition as it is and create a new mount point for the one directory taking up the most space (/usr), we really don't have a lot of flexibility since the system has already been set up and I wanted to dedicate more space to the directory containing our web apps and databases.

Unfortunately, our root partition was created at the end of the drive, meaning it was impossible to expand it. So in order to expand the partition I had to move the root partition so that it was located in the empty space in the middle of the drive. Then I could expand it to use the empty space at the end of the drive.

The biggest problem with this plan is that there is not enough empty space on the drive to create a new partition large enough to hold the content of the root partition. And in order to avoid serious problems I want to keep the current root partition available as a backup in case something goes wrong with the process of creating the new root. Also, you can't resize an active partition, so I also had to use some space on one of the other partitions to create a temporary root partition. So here's the steps I took:
  1. create a subdirectory of our largest partition (/inet) to hold a copy of the root partition
  2. using the free space on the new partition
  3. mount the root partition to a new mount point for copying (/mnt/tmp)
  4. copy the content of /mnt/tmp to /inet/rootdir
  5. go to the partitioner in YaST
  6. assign the empty space on the drive containing the root partition to a new partition and mount it to a temporary location (/mnt/rootnew)
  7. apply your changes and exit YaST
  8. copy the content of /inet/rootdir to the new drive, excluding /usr since it contains too much data
  9. create a symbolic link on /mnt/rootnew that points usr to /inet/rootdir/usr; sudo ln -s /inet/rootdir/usr /usr
  10. go to /mnt/rootnew/etc and sudo edit fstab; change the / mount to /mnt/rootold; change the /mnt/rootnew mount to /
  11. go back into YaST and go into the boot loader (System -> Boot Loader)
  12. create a copy of the primary boot entry; point the original boot entry to point to the new root partition
  13. reboot and cross your fingers
After this I waited a week to ensure that all was well, then completed the process. I neglected to ensure that I had some space available for a temporary root partition, so I had to take a few extra steps. Otherwise the process is as follows:
  1. set up the temporary root partition and boot to it (make sure to edit the fstab to point the the / mount to the temporary root partition and to comment out the old root mount from fstab; also might as well set the default init to 1 in inittab)
  2. from within YaST expand the root partition to occupy the available space on its drive and exit YaST
  3. check the newly expanded file system for errors; e.g. reiserfsck --check /dev/hda6
  4. mount /dev/hda6 (i.e. to /mnt/tmp)
  5. delete the symbolic link for /usr in /mnt/tmp
  6. copy /inet/rootdir/usr to /mnt/tmp/usr
  7. reboot
  8. enter single-user mode: sudo /sbin/init 1
  9. comment out the /inet partition in fstab
  10. unmount /inet
  11. from within YaST delete the temporary root partition and expand /inet to take over the rest of the drive
  12. run a file system check on the /inet partition
  13. reboot and perform some system checks to see if any errors are cropping up (e.g. hit the web server applications)
Post Notes
I found a few commands the helped immensely when attempting to sort out directory sizes when trying to determine how to handle moving the root partition.
  • df shows partition usage
  • du shows directory usage; du --max-depth=1 let me see the size of the directories immediately below my current location
The problem with how to handle the root directory due to the lack of drive space wasn't my only source of frustration. Not all went well with the process since I was still trying to figure out what to do and how I was going to do it.

Since I wasn't too familiar with partitioning and mounting in linux I had to go through a bit of trial and error. My biggest mistake came when I was trying to set up the new root. I initially went ahead and tried to re-assign everything through the YaST partitioner, but the process errored out before finishing resulting in a partial update to the system. The fstab file was only partially updated, the boot loader was changed to reflect the new root partition. I'm pretty sure that the reason I ran into problems is because I neglected to set up host directories for the mounts.

Following the errors I made things worse by trying to clear up the issues by rebooting the system rather than finishing the work and making sure everything was set up correctly. And I rebooted before copying any files at all to the new root partition. After reboot the boot loader was looking for system files on the new partition, and since they didn't exist the system wouldn't boot. Once I determined that the boot loader was pointing to the incorrect partition, I used the grub editing capabilities to point the boot command to the correct partition.

Upon getting the system up and running again I ran into a new problem, the filesystem was being loaded as read-only, meaning I was unable to make any changes to the system. My guess is that there was a conflict caused by both the old root partition and new one being mounted at the same time. I was telling grub that the root partition was in one location, but the system was looking for it in another location. The conflict led to the system setting the mount to read-only mode.

Eventually I was able to unlock the root partition by executing sudo mount -o remount,rw /. Once this was done I manually edited the fstab file to mount the partitions correctly and was able to get the system running again. At this point I essentially followed the process described to get the new partition running as the default.

References:
Partitioning & MountingCreating Symbolic Links
General linux usage
Using GRUB

Tuesday, May 22, 2007

Access Denied

I was troubleshooting some problems with one of our laptops when I ran across a particularly disconcerting source: access to some registry entries by the administrator was no longer allowed. How this happened I can't even begin to imagine, but it doesn't appear to be an entirely uncommon occurrence. I found a nice, quick write-up on how to reset the permissions on the entire registry. A bit of a sledgehammer approach, but I think in this instance it was probably necessary.

So far things seem to be functioning normally again. Hopefully this will address the problems I've been seeing.

Monday, May 21, 2007

SUSE 10 System Time Drifting

The system time on our SUSE 10 box drifts rather horrendously. The main culprit appears to be the fact that it's running in a virtual environment. A quick search on Google indicates that this is not an uncommon problem, the source of which appears to be that SUSE 10 (linux kernel 2.6) uses an interrupt rate that is higher than the default host maximum (1000Hz vs. 100Hz).

See Clock in a Linux Guest Runs More Slowly or Quickly Than Real Time for more information.

I attempted to use the boot loader fixes suggested for the guest OS but did not have any luck. Even with the drift, the NTP daemon should be able to keep the time acceptably. Unfortunately it doesn't seem to work consistently. I suspect this may be due to the drift resulting in the system unable to verify the time. At any rate, I seem to have hit a wall in this regard.

Perhaps the best option for addressing this issue is with the setup of the VMX. One of the options within the environment is to sync the clock with the host operating system (tools.syncTime = TRUE). This option requires that the vmware-tools be installed correctly. Since we don't have access to the VMX parameters we'll have to work with tech services to see if this is set. (And if so why it isn't keeping the clock in sync.)

This last option is probably best because it negates the need for NTP on the guest OS, decreasing overall traffic to/from the server.

It looks like I'll be disabling NTP for the time being since it doesn't seem to be keeping time and it may possibly be crashing the server (too bad we can't just change the battery). And if we implement the preferred solution I won't need to re-enable it.

References:

Monday, May 14, 2007

Dreamweaver Find/Replace Funkiness

When using Dreamweaver 8.0.2 to do a find/replace occasionally the following message may be encountered:
The following files changed after the initial search was performed:
[current document]
It appears to be a minor bug in DW. If you restart the program the same exact find/replace will work fine.

References: