devops

Trouble with the clock

You know what they say, time is money. But for us engineers, time needs to be correct. When it isn’t you lose a lot of it and thus money. Trouble with the time can even pop-up in unexpected places: Your lab.

The lab is a playground for software engineers where they get things done. Viewed by corporate IT as a necessary evil and block from the rest of the world. IT will generally give zero to no support for machines in the lab, so the engineers are left maintaining the machines themselves. Although they are creative, they are not the best operators.

The last thing I came across in our lab was trouble with the clock. The last months I noticed, when we had rapid build/test cycles, the latest build was not picked up by our machines that we’re running the tests on. In general it was OK, but when you needed it the most it sometimes failed to pick up the most recent build.

We have a fairly complex setup with different servers and agents. We have our Bamboo CI-server pushing our artifacts to the repository. On our build agents will let Maven poll the repository for the latest artifacts. The Maven Parent POM’s on our build agents are slightly different from the POM’s that the developers have on their machines. The agent POM’s allow getting SNAPSHOT build so we can move our not-released artifacts to our agents, for further processing and testing.

Over the past months the problem got worse so it was time to start investigating. The problem was quickly found: clock drift. Because our agents where in an IP range that had no access to the internet the default time sync servers where not accessible. In about half a year we had drift of about 30 minutes. So if we had a build/test cycle in that 30 minute slot we where testing an older release. Lets go over what happened:

Cycle 1:

  • Bamboo Server (14:30) -> Repo (14:30)
  • Agent (14:35 + 30m drift) asks for new artifact, my last is from 9:12?
  • Repo -> I got a newer file (14:30)
  • Agent downloads, and saves to disk (14:35 + 30m drift = 15:05)

Cycle 2:

  1. Bamboo Server (14:45) -> Repo (14:45)
  2. Agent (14:50 + 30m drift) asks for new artifact, my last is from 15:05?
  3. Repo -> My file is older, it’s from (14:45)
  4. Agent uses the previous older version

The biggest problem was that this went unnoticed for a long time, because the problem only occurred when a new cycle is started on the same agent (we have multiple agents) within the slot created by the clock drift.

The solution is simple though, use one of your server that have access to the standard time-server and use that as a delegate for the sync requests.

# in not yet installed
sudo yum -y install ntp
# sync the time from your internal NTP server
sudo /usr/sbin/ntpdate -v 192.168.42.42
# edit the ntp config and start/restart the NTP deamon
sudo vim /etc/ntp.conf # set server to 192.168.42.42
sudo /etc/init.d/ntpd start
sudo /sbin/chkconfig ntpd on

Conclusion

Making sure that the clocks of your machines are in sync is not only a matter for the production server. It’s also important in your lab. It would be easier if IT didn’t disconnect the lab from the rest of the world, but this is just reality. We now added a new item on our agent build checklist: Make sure the machines a synced automatically with a sync server it can access. Could be an interesting feature for our Atlassian Bamboo CI server though: Send a notification to the admins that the agents have clock drift.

Standard