Java Large Files – Efficient Processing

Guide to Optimal ways of Java Large Files Processing to avoid OutOfMemoryError. Compare the fastest and the most memory-efficient ways to read and write files.

Overview

This tutorial discusses ways to process large files in Java and How to avoid Java OutOfMemoryException while transferring or processing large files. Java File IO and Java NIO provide various ways of dealing with files. However, large file handling is challenging because we must find the right balance between speed and memory utilization.

This article will use different ways to read a massive file from one place and copy it to another. While doing so, we will monitor the time it takes and the memory it consumes. Finally, we will discuss their performances and find the most efficient Java Large File Processing method.

We will write examples of transferring large files by using Java Streams, using Java Scanners, using Java File Channels, and then by using Java BufferedInputStream. However, to begin with, we will discuss the fastest way of the file transfer.

Fasted way of Java Large File Processing

This section covers the fastest way of reading and writing large files in Java. However, a quicker way doesn’t mean a better way, and we will discuss that soon.

When we use a Java IO to read or write a file, the slowest part of the process is transferring the file contents between the hard disk and the JVM memory. Thus, to make File IO faster, we can reduce data transfer times. And the easiest way of doing this is to transfer everything in one go.

For example, using Files.readAllBytes()

byte[] bytes = Files.readAllBytes(sourcePath);Code language: Java (java)

Or using Files.readAllLines().

List<String> lines = Files.readAllLines(sourcePath);Code language: Java (java)

In the first snippet, the entire content of the file is copied into a byte array, which is held in memory. Similarly, in the second snippet, the entire content of a text file is read as a List of strings and stored in memory too.

The following method reads byte[] from a source file and writes those bytes[] on the target file.

private void copyByUsingByteArray() throws IOException {
  Path sourcePath = Path.of(source);
  Path targetPath = Path.of(target);

  byte[] bytes = Files.readAllBytes(sourcePath);
  Files.write(targetPath, bytes, StandardOpenOption.CREATE);
}Code language: Java (java)

Using this method, we will process a 667 MB File to read it from the source and write it to the target. We run this method in a separate thread to observe the memory footprint. Also, while the copy happens in the thread, on fixed intervals, the parent thread prints the amount of free memory (in MB).

Source File Size 667
Memory used: 9
Memory used: 676
Memory used: 676
total time 1803Code language: plaintext (plaintext)

The transfer finished fast; however, it consumed a lot of memory. This solution is impractical when copying large files or processing multiple such files simultaneously.

Using BufferedReader and Java Streams

Now, we will test the performance of the Java Streams to process a huge file. To do that, we will use BufferedReader, which provides a Stream of strings read from the file.

Next is an example of using Java Stream provided by BufferedReader to process a huge file (10GB).

private void copyUsingJavaStreams() throws IOException {
  try (
      InputStream inputStream = new FileInputStream(source);
      BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(inputStream));

      FileWriter fileWriter = new FileWriter(target, true);
      PrintWriter printWriter = new PrintWriter(new BufferedWriter(fileWriter));
      Stream<String> linesStream = bufferedReader.lines();
  ) {
    linesStream.forEach(printWriter::println);
  }
}Code language: Java (java)

Now, we will test the method that uses BufferedReader to read a 10GB file.

 Source File Size 10471
 Memory used: 9
 Memory used: 112
 Memory used: 71
 Memory used: 17
 Memory used: 124
 Memory used: 76
 Memory used: 28
 Memory used: 69
 Memory used: 35
 Memory used: 47
 total time 42025Code language: plaintext (plaintext)

The Java Streams are lazy, and that is why they provide optimal performance. That means while each line from the stream is being written to the target, the next ones are efficiently read from the source. That is evident with the memory logs, as the highest memory consumption was less than 125MB, and the Garbage Collector was doing its job in between. Although it performed better on memory, it took around 42 seconds to finish the file processing.

Java Scanner

Java Scanner is used to scan through a file and supports streaming the content without exhausting a large amount of memory.

Next is an example of using Java Scanner to copy a 10GB file.

private void copyUsingScanner() throws IOException {
  try (
      InputStream inputStream = new FileInputStream(source);
      Scanner scanner = new Scanner(inputStream, StandardCharsets.UTF_8);

      FileWriter fileWriter = new FileWriter(target, true);
      rintWriter printWriter = new PrintWriter(new BufferedWriter(fileWriter));
  ) {
    while (scanner.hasNext()) {
      printWriter.println(scanner.next());
  }
}Code language: Java (java)

Output:

 Source File Size 10471
 Memory used: 9
 Memory used: 8
 Memory used: 9
 Memory used: 110
 Memory used: 27
 Memory used: 176
 Memory used: 44
 Memory used: 13
 Memory used: 74
 Memory used: 17
 Memory used: 184
 Memory used: 35
 total time 660054Code language: plaintext (plaintext)

Although the Scanner has used almost the same amount of memory, the performance is prolonged. It took around 11 minutes to copy a 10GB file from one location to other.

Using FileChannel

Next, we will cover an example of using Java FileChannels to transfer a large amount of data from one file to another.

private void copyUsingChannel() throws IOException {
  try (
      FileChannel inputChannel = new FileInputStream(source).getChannel();
      FileChannel outputChannel = new FileOutputStream(target).getChannel();
  ) {
    ByteBuffer buffer = ByteBuffer.allocateDirect(4 * 1024);
    while (inputChannel.read(buffer) != -1) {
      buffer.flip();
      outputChannel.write(buffer);
      buffer.clear();
    }
  }
}Code language: Java (java)

Here, we use a buffer of (4 * 1024) size.

 Source File Size 10471
 Memory used: 9
 Memory used: 10
 Memory used: 10
 Memory used: 10
 total time 21403Code language: plaintext (plaintext)

From the output, it is clear that this is, so far, the fastest and most memory-efficient way of processing large files.

Process Large File In Chunks (BufferdInputStream)

Finally, we will look at the traditional way of processing large amounts of data in Java IO. We will use a BufferedInputStream stream with the same size buffer we used for FileChannels, and analyze the results.

Next is an example of Reading and Writing Large Files in Chunks using Java BufferedInputStream.

private void copyUsingChunks() throws IOException {
  try (
      InputStream inputStream = new FileInputStream(source);
      BufferedInputStream bufferedInputStream = new BufferedInputStream(inputStream);

      OutputStream outputStream = new FileOutputStream(target);
  ) {
    byte[] buffer = new byte[4 * 1024];
    int read;
    while ((read = bufferedInputStream.read(buffer, 0, buffer.length)) != -1) {
      outputStream.write(buffer, 0, read);
    }
  }
}Code language: Java (java)

Output:

 Source File Size 10471
 Memory used: 9
 Memory used: 10
 Memory used: 10
 Memory used: 10
 total time 20581Code language: plaintext (plaintext)

And the performance we see is similar to the Scanner. That is because we used a buffer of the same size.

Most Efficient Way of Java Large File Processing

We have tried various ways of reading and writing huge files in Java. In this section, we will discuss their performance and understand which one is the optimal way of extensive file handling in Java.

In Memory Transfer

As stated earlier, the in-memory transfer is a fast way of data transfer. However, holding the entire content of a file in memory, for example, byte[] or List<String>, is not practical with very large files. It can quickly exhaust all available memory when a file is very large, or the application serves multiple such requests simultaneously.

Java Stream and Scanner

In the Java Stream example of processing large files, we generated a Stream of lines using BufferedReader, which produced a decent result. Similarly, for example, Java FileScanner to transfer large files turned out better on the memory. However, both of these transfers were slow.

FileChannel and Chunk Transfer using BufferedInputStream

We have also seen examples of using FileChannel and BufferedInputStream to read and write huge files. At the base of both examples, we used a fixed-sized buffer. Both of these ways demonstrated better speed and low memory consumption performance.

Moreover, we can still improve the performance of these two ways by using larger buffers because larger buffers mean lesser interactions with underlying files. However, larger buffers also mean a more significant consumption of memory. We will rerun both examples with a buffer size of 1048576 (or 1MB) to prove that.

BufferedInputStream

We will modify the buffer size.

byte[] buffer = new byte[1048576];Code language: Java (java)

And the output we get:

 Source File Size 10471
 Memory used: 9
 Memory used: 12
 Memory used: 12
 Memory used: 12
 total time 11390Code language: plaintext (plaintext)

FileChannel

Similarly, we will increase the ByteBuffer value in the FileChannel Example.

ByteBuffer buffer = ByteBuffer.allocateDirect(1048576);Code language: Java (java)

And the result looks like this:

 Source File Size 10471
 Memory used: 9
 Memory used: 10
 Memory used: 10
 Memory used: 10
 total time 11431Code language: plaintext (plaintext)

Both of the outputs above show a performance improvement, with a slightly more impact on the memory.

Conclusion

This detailed practical comparison concludes that using a buffer is the best way to transfer a large amount of data using Java IO. Copying the file in chunks helps to limit the amount of consumed memory consumed by the file content.

Both the FileChannel and BufferedInputStream performed head-to-head in our tests. The advantage of using BufferedInputStream or FileChannel to read large files is that they have a configurable buffer. Thus, based on the server load’s nature and the file’s size, we can control the buffer size and eventually find an optimal and the most efficient way to read large files in Java IO.

Summary

In this long and practical-oriented tutorial, we discussed Java Large File Processing. We began by understanding that we can speed up large file reads at the cost of memory consumption. Or Keep the memory utilization to a minimum by slowing down the processing.

Also, we practically tested these ways, which included using Java Streams, Java Scanner, Java FileChannel, and Java BufferedInputStream to transfer a 10GB file and analyzed their performance. Finally, we concluded that BufferedInputStream and the FileChannel are the optimal and most efficient ways to read and write large files in Java IO. They offer excellent control to optimize extensive file handling in Java. For more on Java, please visit Java Tutorials.