Project

Profile

Help

Bug #4795

Should we be following a redirection when we get HTTP response 301?

Added by Michael Kay 6 days ago. Updated about 8 hours ago.

Status:
Resolved
Priority:
Low
Category:
Internals
Sprint/Milestone:
-
Start date:
2020-10-16
Due date:
% Done:

0%

Estimated time:
Legacy ID:
Applies to branch:
10, 9.9, trunk
Fix Committed on Branch:
trunk
Fixed in Maintenance Release:

Description

This arises from test case qt3extra read-binary-001 which does

saxon:read-binary-resource("http://www.saxonica.com")

This is showing a 301 error because the HTTP request responds with status 301 saying that the content has permanently moved to the equivalent HTTPS URL.

No doubt also this happens on other calls where we retrieve resources using HTTP.

Should we be following the redirect, and returning the resource found at "https://www.saxonica.com"?

Encountered on 9.9, but no doubt applies on other branches also.

History

#1 Updated by Michael Kay 6 days ago

Note: this particular function is implemented in Extensions.java#567 with the logic:

            URL url = new URL(arg);
            InputStream stream;
            try {
                stream = url.openStream();
            } catch (IllegalArgumentException e) {
                throw new XPathException("saxon:read-binary-resource - invalid URL " + arg, SaxonErrorCode.SXJE0053);
            }

Similar logic appears, for example, in UnparsedTextResource.getContent().

#2 Updated by Norm Tovey-Walsh 6 days ago

I wondered about this myself recently. I think I fixed a few tests to use https: in order to work around the problem a couple of weeks ago.

Web browsers follow redirects by default, but curl doesn't, so I think there are conflicting expectations.

I think most users in most cases would expect redirects to be followed automatically. But if we do that, it'll be impossible to use Saxon to write a web application that detects redirects. I suppose a compromise would be to have a configuration setting to select the desired behavior, with the default being to follow redirects.

#3 Updated by Michael Kay 6 days ago

I'd be inclined to say that functions like doc() should follow redirects, and we should have a lower-level HTTP library for those who want more precise control.

#4 Updated by Norm Tovey-Walsh 6 days ago

Do we currently support the expath http: module? Maybe we should support that for lower-level access.

#5 Updated by Michael Kay 6 days ago

I think there's a third-party implementation of the HTTP module. But for Saxon-JS we decided to do something more 3.0-like using maps rather than nodes for the request and response structures.

#6 Updated by Norm Tovey-Walsh 3 days ago

I created a new class, net.sf.saxon.resource.Loader with a single static method, urlStream():

    /**
     * The maximum number of redirects to follow before throwing an IOException.
     */
    public static int MAX_REDIRECTS = 10;

    /**
     * Open a stream to retrieve the content identified by the URI. For HTTP URIs, this
     * method will follow up to MAX_REDIRECTS redirects or until it detects a loop.
     *
     * @param url The URL to retrieve.
     * @return An InputStream for the resource content.
     * @throws IOException If more than MAX_REDIRECTS are occur or if a loop is detected.
     */
    public static InputStream urlStream(@NotNull URL url) throws IOException  {
        if ("http".equals(url.getProtocol()) || "https".equals(url.getProtocol())) {
            HashSet<String> visited = new HashSet<>();
            String cookies = null;
            int count = MAX_REDIRECTS;
            for (;;) {
                HttpURLConnection conn = (HttpURLConnection) url.openConnection();
                if (cookies != null) {
                    conn.setRequestProperty("Cookie", cookies);
                }
                int status = conn.getResponseCode();
                if (status == HttpURLConnection.HTTP_MOVED_PERM
                        || status == HttpURLConnection.HTTP_MOVED_TEMP) {
                    String location = conn.getHeaderField("Location");
                    url = new URL(location);
                    cookies = conn.getHeaderField("Set-Cookie");

                    if (visited.contains(location)) {
                        throw new IOException("HTTP redirect loop through " + location);
                    }
                    visited.add(location);

                    count -= 1;
                    if (count < 0) {
                        throw new IOException("HTTP redirects more than " + MAX_REDIRECTS + " times");
                    }
                } else {
                    return conn.getInputStream();
                }
            }
        } else {
            return url.openStream();
        }
    }

That's pretty much a drop-in replacement for the call in Extensions.java. But it doesn't help elsewhere because most of the doc() requests are routed through a resolver. I'm pretty sure the catalog resolvers will traverse redirects, so I expect this error doesn't arise when one of them is being used. It does arise when the StandardURIResolver is being used as a fallback, however.

I decided after some investigation that the right answer there is to change setSAXInputSource. At the moment, it special cases classpath: URIs and then uses the URI string for everything else. I changed that to attempt to use the new urlStream() method:

protected void setSAXInputSource(SAXSource source, String uriString) {
        if (uriString.startsWith("classpath:") && uriString.length() > 10) {
            InputStream is = getConfiguration().getDynamicLoader().getResourceAsStream(uriString.substring(10));
            if (is != null) {
                source.setInputSource(new InputSource(is));
                source.setSystemId(uriString);
                return;
            }
        }

        try {
            InputStream is = Loader.urlStream(new URL(uriString));
            source.setInputSource(new InputSource(is));
            source.setSystemId(uriString);
        } catch (IOException e) {
            source.setInputSource(new InputSource(uriString));
            source.setSystemId(uriString);
        }
    }

That fixes the doc() function and probably most other places where URIs are requested.

Looking around at where else url.openStream() is used, I found it in Configuration where it arises in the last ditch effort to find something with a class loader. I decided to leave that one alone. It also occurs in a couple of optional cpp classes where I did add it.

But as I was reviewing this code, I came to the conclusion that there's another error here. The saxon:read-binary-resource() method is currently getting the raw URI directly. I think it should be adjusted so that it uses the URI resolver.

#7 Updated by Michael Kay 3 days ago

That seems to handle it for XML resources, but I'm not sure about non-XML. Generally we avoid using a URIResolver for non-XML resources such as saxon:read-binary-resource() (and unparsed-text(), and the EXPath file library...) because it can return any Source and we wouldn't know what to do if, for example, it returned a DOMSource.

unparsed-text() uses

URLConnection connection = absoluteURL.openConnection();
connection.setRequestProperty("Accept-Encoding", "gzip");
connection.connect();
inputStream = connection.getInputStream();

#8 Updated by Norm Tovey-Walsh 3 days ago

Maybe we just need a completely revamped set of resolver APIs. Unfortunately, that thread goes all the way back into the parser. Though I suppose a better API could also implement the less sophisticated APIs. More to think about.

In the meantime, I'll dig around for more uses of openConnection() as well.

#9 Updated by Norm Tovey-Walsh about 8 hours ago

  • Status changed from New to Resolved
  • Fix Committed on Branch trunk added

I've committed changes to trunk that cause most standard URI references to follow redirects. This covers doc() and related functions, XSLT and XQuery imports, and most other cases. The exceptions are cases that seemed to be specifically about the class loader and jar: scheme URIs where it presumably doesn't apply.

Please register to edit this issue

Also available in: Atom PDF