MavEtJu's Distorted View of the World - 2007-06
Slony Replication and Inherited Objects
Microsoft Windows TCP/IP Stack Behaviour On computers shutting down... Routers routers routers routers... and troubles So much for a nice hierarchy... Back to index Slony Replication and Inherited ObjectsPosted on 2007-06-30 16:00:00 Our PostgreSQL servers, called Janus and Kermit, use Slony replication to make sure we have a backup copy of the data. And if we need to do work on Janus, we fall over to the databases on Kermit and everything is fine for the users. One of our databases uses inherited objects, and tables in that database use a number of standard fields and a lot of table specific fields. Recently we added a field to that parent object. Replication went fine, Slony can handle this. One of the issues we have with that database is that we can't "vacuum full" it, which means that it will grow and grow and grow. At a certain moment the partition becomes full and we need to fall over to the other server and drop the database and replicate it back. This happens about once every two months, it takes fives minutes and everything is fine again. Except for today... This is the parent object in the master database: And this is an inherited object:mail=# \d barnet_objects Table "public.barnet_objects" Column | Type | Modifiers -------------------+---------+------------------------------------------ id | bigint | not null default nextval('barnet_object_i object_type | integer | default 0 owner | bigint | default (0)::bigint creator | bigint | default (0)::bigint acl | bigint | default (0)::bigint notforpublication | boolean | default false The experienced eye can see that the field notforpublication is the new field in the barnet_object.mail=# \d chambers Table "public.chambers" Column | Type | Modifiers -------------------+---------+------------------------------------------ id | bigint | not null default nextval('barnet_object_i object_type | integer | default 6 owner | bigint | default (0)::bigint creator | bigint | default (0)::bigint acl | bigint | default (0)::bigint name | text | description | text | network_code | text | [...] ldap_address | text | notforpublication | boolean | default false pgdump gives the following command to create the chambers table: And it creates this table in the database:CREATE TABLE chambers ( object_type integer DEFAULT 6, name text, description text, network_code text, [...] ldap_address text ) INHERITS (barnet_objects); As you can see, the order is different. And Slony replicates based on the order of fields and now complains about...Table "public.chambers" Column | Type | Modifiers -------------------+---------+------------------------------------------ id | bigint | not null default nextval('barnet_object_i object_type | integer | default 6 owner | bigint | default (0)::bigint creator | bigint | default (0)::bigint acl | bigint | default (0)::bigint notforpublication | boolean | default false name | text | description | text | network_code | text | [...] ldap_address | text | DEBUG3 remoteWorkerThread_4: table "public"."chambers" does not require Slony-I serial key DEBUG4 remoteWorkerThread_4: Begin COPY of table "public"."chambers" ERROR remoteWorkerThread_4: copy from stdin on local node - PGRES_FATAL_ERROR ERROR: invalid input syntax for type boolean: "BARNET" CONTEXT: COPY chambers, line 1, column notforpublication: "BARNET" WARN remoteWorkerThread_4: data copy for set 1 failed - sleep 60 seconds So instead of a five minute outage this afternoon, we'll have a three-four-five-six hour outage during the night, in which I have to pgdump | psql the data to the now-slave-database and start replication on that one back to the now-master: COPY chambers (id, object_type, "owner", creator, acl, name, description, network_code, pop_server, smtp_server, mail_domain, proxy_server, imap_address, ldap_address, notforpublication) FROM stdin; 10818 6 1 961 0 BARNET BarNet Internal N022 pop.barnet.com.au smtp.barnet.com.au barnet.com.au proxy.barnet.com.au imap.barnet.com.au ldap.barnet.com.au f No comments | Share on Facebook | Share on Twitter Microsoft Windows TCP/IP Stack BehaviourPosted on 2007-06-21 09:00:00, modified on 2007-06-21 14:00:00 Recently I had to redo the design of the machine with our public websites, and after an earlier successful implementation of virtualisation with FreeBSD jails, I decided to put them all in their own private jail, with their own public IP address, too. Since I'm a firm believer in "eat your own stuff" and my website was on the list of sites to be moved, I decided to do that one first. The IP range we have for it was 202.83.176.0/24, and since the first half of it was already in use by other services, I started to go down from 255. To make life easier for us, we use a lot of dynamic routing in our network. Also with jails: They're defined on the loopback interfaces and the subnet masks are all /32's. The combination of these two should make it easy to move them around if necessary without having to worry about physical machines and subnets and DNS. So, we have this new webserver (my webserver, so somehow important to me) on 202.83.176.255 and it seems to work fine. I can access it from inside the network, I can access it from outside the network, I see webbrowsers and spiders connecting to it. Life is good! Except... I get reports from people saying that they can't get to my website, that there is some kind of DNS error: Cannot find server or DNS error is what Internet Explorer tells them. I ask them: "Can you ping the machine? "No that's not workin." "Can you telnet to it?" "No, it says Connect failed.". I don't see anything in the logs, I don't see anything on the network. No idea what goes wrong here... Finally I get the same message from friends who have elite skillz in the ancient arts of ping, traceroute, telnet and tcpdump (Hi dvl, koitsu!). And we start trying: Yes, we can ping 202.83.176.255, so there is nothing wrong on the end-to-end network layer. No, we can't ping 202.83.176.255, but I saw their ICMP packets on the webserver. From inside the jail, I can connect to their hosts, so there is nothing wrong with TCP sessions. We advertise a /21 to the world, so it won't be a network boundary problem. One of them can connect to the webserver (He's running FreeBSD), and one of them cannot (He's running Windows), I see the packets of the first, but not the packets of the second (whose ICMP packets I saw). Then the one with FreeBSD tries it with his Windows machine and he can't suddenly anymore. I think we narrowed the problem down to one thing: Microsoft Windows (Ouch, it did it again). We do more tests: On the Windows machine, we cannot ping 202.83.176.255 (but I see the ICMP packets. We cannot setup a TCP session to it (and I don't seen any TCP packets). We can ping 202.83.176.254, and we can setup a TCP session to it. Now put one and one together.... Historically, 202.83.176.255 is in a class C subnet, going from 202.83.176.0 to 202.83.176.255. These days, with Classless Inter-Domain Routing, that subnet can be split in many little subnets, or be part of a supernet. Somehow, Windows still thinks in classfull subnets (You can see it with the default subnetmask it suggests when you configure an IP address on a network interface). And it prohibits TCP traffic halfway in the IP stack traffic to that IP address. To test this, we tried the following on the Window machines:
But still:
Anyway, the webserver now runs on 202.83.176.248 and Windows machines are happy again. See also the thread at DSL Reports.com. Update: The problem is confirmed in Windows2000, Windows2003 and Windows XP. Vista handles the ICMP and TCP packets as expected. No comments | Share on Facebook | Share on Twitter On computers shutting down...Posted on 2007-06-20 20:00:00, modified on 2007-06-20 10:00:00 A short time ago my computer started shutting down at random times. No reboot, no kernel panic, just a full power-shutdown. The first time it was when I was recompiling the new Xorg 7.2 distribution, and that left me without a desktop for two days. Pressing the power button would not bring back the computer, it needed to be done a couple of times. Later on, when Xorg was running again, it happened when compiling a new version of GCC. I suspected it was caused by some strange GCC issue (don't trust computers, but don't trust compilers neither!). And it happened when reencoding media streams. All very CPU intensive issues. I mentioned the issue a couple of times, and people suggested I looked at the CPU temperature: It might be a motherboard trying to protect itself against overheating. sysutils/mbmon told me that the fan was running at 2400 rpms and the temperature was between 75 and 85 degrees Celsius. 75 and 85 degrees Celsius!???! That's a little bit much. But the PC Health screen in the BIOS showed the same. And it showed that it would shutdown the computer at 90 degrees Celsius. Aha. That's one and one. People suggested to check if the fan was still working (I could hear it), or to replace the thermo-paste between the cooling-metal and the CPU. But first, they urged, see if the fan is dirty. Ouch... That was a dead give-away. The whole cooling-metal below the fan was gray-brown with a cover of dust. Trying to blow that away would not be possible without causing a local fog cloud! After the cleaning the cooling-metal, and the fan, and the powersupply, and the videocard, and the rest, the CPU was back at temperatures between 35 and 45 degrees Celsius and running the fans running at 1700 rpms. Note: According to my wife, there was no danger of getting the CPU fried. That's because frieing happens at 200 degrees Celsius, while this wasn't even close to boiling (She's a chef :-) Show comment | Share on Facebook | Share on Twitter Routers routers routers routers... and troublesPosted on 2007-06-13 22:00:00, modified on 2007-07-05 09:00:00 Due to a recent change in network infrastructure, we needed to move the place where the IP NATting is done. Logically speaking it's now done in the middle of the network, so that the traffic from the users (on the left) is passing through it, but the traffic from the servers (on the right) is not going through it and that sometimes gives problems. In the right-hand side of the network we have an Extreme Networks BlackDiamond 8806 (to the servers) and an Extreme Networks X450 (to the BD8806 and the Internet), and the idea was that all traffic with RFC IP addresses as source going through it would be redirected through a NAT gateway. Great design, should work without problems! entry redirect_to_nat { if match all { source-address 10.0.0.0/8 ; } then { redirect 10.252.13.38 ; } } The manual says "You can use the statement configure access-list <aclname> < ingress | egress>". The BD8806 says "I only know ingress filtering". Yes, that is right. So instead of egress filtering on one port, we need to do ingress filtering on 47 ports. Not my idea of a good time. To do egress filtering you need a BD10808 or an 12804, not an BD8806.... No problem, then we do ingress filtering on the X450 connected to it! That one supports ingress filtering, but... It doesn't support the "redirect" command: You need an X450a or an X450e for that. Curse, swear, stamp-with-feet-on-the-ground-while-screaming. To make a short project long... We have ordered another X450a... Update: The new X450a came without a core licence, which means that it won't support BGP. We had two leftover core licence vouchers of earlier X450's which were never used. And today we found out that, despite that they are bought for the same price and offer the same functionality, that they can't be used on an X450a. More updates: Because we couldn't install a real license, we installed a trial license. They give you the Core license functionality, are valid for 30 days and give you the opportunity to install your hardware without having to wait two days before the real license arrives. You can upgrade without rebooting from no license to trial license, and from no license to real license. To upgrade from trial license to real license you need to use the command "clear license-info", which tells you to reboot, but after the reboot the trial license is still there. It wasn't until we got into debug mode of the switch (which you can only do with the help of Extreme Networks TAC), entered the command "debug epm clear trial-license", and rebooted the switch. No comments | Share on Facebook | Share on Twitter So much for a nice hierarchy...Posted on 2007-06-13 17:00:00 Over the past years, I've created a nice hierarchy in DNS to keep my insanity under control. For example, for the POP server we have (pop.barnet) which points with a CNAME to pop2.barnet which point with a CNAME to the dbmail2.barnet jail which point with an A record to the IP addresses of the machine: So if the machine fails, or the dbmail jail doesn't work anymore, or the dbmail-pop3 program is broken, all we have to do is one little changes in the hierarchy and it is all working again, without disrupting the real operation of the machine.pop 60 IN CNAME pop2 pop2 IN CNAME dbmail2 dbmail2 IN A 202.83.178.99 Since earlier this month we don't have one, but two POP servers! And of course the easiest solution would be: Let pop.barnet be a CNAME to both pop1.barnet and pop2.barnet. And there starts the trouble:pop 60 IN CNAME pop1 pop 60 IN CNAME pop2 pop1 IN CNAME dbmail1 pop2 IN CNAME dbmail2 dbmail1 IN A 202.83.178.88 dbmail2 IN A 202.83.178.99 Well, I'm (!)@*#()!@*#'d. This is not allowed... Now I have, because it can't be done any different, reintroduced A records for the services....Jun 13 16:17:24 ns0 named[3106]: dns_master_load: .db/barnet.com.au:203: pop.barnet.com.au: multiple RRs of singleton type pop 60 IN A 202.83.178.88 ; pop1 pop 60 IN A 202.83.178.99 ; pop2 pop1 IN CNAME dbmail1 pop2 IN CNAME dbmail2 dbmail1 IN A 202.83.178.88 dbmail2 IN A 202.83.178.99 Show 2 comments | Share on Facebook | Share on Twitter |