[App_rpt-users] Reliability / Network Stability

Bryan Fields Bryan at bryanfields.net
Tue Jun 27 15:57:39 UTC 2017


On 6/26/17 9:10 PM, Benjamin Naber wrote:
> Over the course of the last six months or so, I have noticed there have
> been some issues with allstarlink.
> 
> Either the allstarlink website doesn't work, connections are radanomly
> dropped on known high-reliability networks and connected equipment.
> All without apparent cause.

When and where?

Allstarlink.org is and has been online and stable for some time.

Docs.allstarlink.org had a network outage recently due to a dead switch.  It
was rectified about 5-6 hours later by our network vendor.

> Again today, for no apparent reason, all links on several systems in
> this area were dropped, and were not able to connect to anyone. Some of
> our nodes have "direct access" to other nodes specified in the rpt.conf,
> and those connections worked fine.

Again, when and where?  Connections from node to node are direct, the only
thing ASL does is build a database and push that to the nodes every 10 mins or
so.

If your nodes are listed as online, but they cannot talk, there is a network
issue unrelated to ASL.


> When a node cannot connect to node 2000, or some other random one, there
> is an issue.

This sort of "error" report is lacking.  You would need to give the errors,
dates/times, source node IP and AS path if you can provide that.

> All nodes in this area have different ISPs, so it rules out the
> possibility of ISP issue.

No, it makes it less likely, but it in no way rules it out.  Are you doing NAT
or is each node on it's own IP?

> Other folks I have talked to across the country have stated they have
> noticed similar issues.

This is nothing but scuttlebutt without evidence.

> Has anyone noticed this, and not said anything, or what is going on?

In the last 6 months ASL has lost Jim Dixon, formally incorporated as a
non-profit organization and been forced to document a number of things which
Jim had in his head.

"The death of God left the angels in a strange position."

We had to identify the ASL infrastructure, which was spread out over a number
of different locations.  We've done this and have access to everything, and
backups in case anything fails.  We've assembled an infrastructure team and
have an architecture we're building into (docker).  Over the next few months
we're going to move servers one by one into this environment.

The mailing lists have been moved to a high performance server with real spam
filtering.  Nagios is watching everything and we know in 5-10 minutes when
there is an outage of a service.

This is a huge undertaking.

Tim's working on a new website, and Steve's been running everything else
including development.  Oh, and all the source code is on github now too.

If there's network issues we _want_ to know about them, but it must be in a
detailed manner.  I've responded to people on reddit and said to post details
over on this list.  Without detailed logs and reports we cannot do anything to
confirm them.

73's
-- 
Bryan Fields

727-409-1194 - Voice
http://bryanfields.net



More information about the App_rpt-users mailing list