Saturday, January 14, 2017

How to find and remove duplicate files in directory using C# code?


In today’s world It is common to have multiple similar images/videos in your computer and it takes lots of your disk space without your knowledge. I am inspired to write this article because I am getting lots of similar images, videos and documents in WhatsApp daily and it is very difficult for me to find duplicate files from directory. In this article, I’ll share C# code to find duplicate files from directory and also remove them.

Let’s look on below folder which contains multiple duplicate images. I want to find all duplicate images from this folder and delete them.


See below code to find and delete duplicate file from this folder.

class Program
{
    static void Main(string[] args)
    {
        string path;
        ConsoleKeyInfo cki;
        double totalSize = 0;
        //pass directory path as argument to command line
        if (args.Length > 0)
            path = args[0] as string;
        else
            path = @"D:\Temp\Images";

        //Get all files from given directory
        var fileLists = Directory.GetFiles(path);
        int totalFiles = fileLists.Length;

        List<FileDetails> finalDetails = new List<FileDetails>();
        List<string> ToDelete = new List<string>();
        finalDetails.Clear();
        //loop through all the files by file hash code
        foreach (var item in fileLists)
        {
            using (var fs = new FileStream(item, FileMode.Open, FileAccess.Read))
            {
                finalDetails.Add(new FileDetails()
                {
                    FileName = item,
                    FileHash = BitConverter.ToString(SHA1.Create().ComputeHash(fs)),
                });
            }
        }
        //group by file hash code
        var similarList = finalDetails.GroupBy(f => f.FileHash)
            .Select(g => new { FileHash = g.Key, Files = g.Select(z => z.FileName).ToList() });


        //keeping first item of each group as is and identify rest as duplicate files to delete
        ToDelete.AddRange(similarList.SelectMany(f => f.Files.Skip(1)).ToList());
        Console.WriteLine("Total duplicate files - {0}", ToDelete.Count);
        //list all files to be deleted and count total disk space to be empty after delete
        if (ToDelete.Count > 0)
        {
            Console.WriteLine("Files to be deleted - ");
            foreach (var item in ToDelete)
            {
                Console.WriteLine(item);
                FileInfo fi = new FileInfo(item);
                totalSize += fi.Length;
            }
        }
        Console.ForegroundColor = ConsoleColor.Red;
        Console.WriteLine("Total space free up by -  {0}mb", Math.Round((totalSize / 1000000), 6).ToString());
        Console.ForegroundColor = ConsoleColor.White;
        //delete duplicate files
        if (ToDelete.Count > 0)
        {
            Console.WriteLine("Press C to continue with delete");
            Console.WriteLine("Press the Escape (Esc) key to quit: \n");
            do
            {
                cki = Console.ReadKey();
                Console.WriteLine(" --- You pressed {0}\n", cki.Key.ToString());
                if (cki.Key == ConsoleKey.C)
                {
                    Console.WriteLine("Deleting files...");
                    ToDelete.ForEach(File.Delete);
                    Console.WriteLine("Files are deleted successfully");
                }
                Console.WriteLine("Press the Escape (Esc) key to quit: \n");
            } while (cki.Key != ConsoleKey.Escape);
        }
        else
        {
            Console.WriteLine("No files to delete");
            Console.ReadLine();
        }
    }
}
public class FileDetails
{
    public string FileName { get; set; }
    public string FileHash { get; set; }
}

Output -



As you can see duplicate files are deleted from your directory.



This programme can be used to delete any duplicate files like images, videos, documents etc.

I hope this code snippet helps you to find and remove duplicate files in your directory and make your life easy. Please leave your feedback in comments below.

See also –


Monday, December 26, 2016

How to handle exception in Parallel.For and Parallel.Foreach loop? – Task Parallel Library


In my previous article, I explained about Parallel.For andParallel.Foreach loop in detail. In this article I will explain, how to handle exception in Parallel.For and Parallel.Foreach loop.

Parallel.For and Parallel.Foreach methods doesn’t provide any special mechanism to handle exception thrown from parallel loops. We can catch unhandled exceptions using try catch block inside parallel loop. When you add any try catch block to catch exception in parallel loop, there is possibility that the same exception may throw by multiple threads in parallel. So we need to wrap try catch block inside parallel loop and also on method from where we call parallel loop to catch all types of exceptions. See below example.

Code –
namespace ParallelFor
{
using System;
using System.Threading.Tasks;
using System.Collections.Concurrent;
class Program
{
    static void Main(string[] args)
    {
        try
        {
            DoSomeWork();
        }
        catch (AggregateException ex)
        {
            //loop through all the exception
            foreach(Exception e in ex.InnerExceptions)
            {
                if (e is ArgumentException)
                {
                    Console.ForegroundColor = ConsoleColor.Red;
                    Console.WriteLine(e.Message);
                }
                else
                    throw e;
            }
        }
        Console.ReadLine();
    }
    public static void DoSomeWork()
    {
        //ConcurrentQueue allows to queue item from multiple threads.
        ConcurrentQueue<Exception> exceptionQueue = new ConcurrentQueue<Exception>();
               
        Parallel.For(0, 100, (i) =>
        {
            try
            {
                if (i == 20 || i == 30 || i == 40)
                    throw new ArgumentException();

                Console.Write("{0}, ", i);
            }
            catch (Exception e)
            {
                //Adding all the exceptions thrown from multiple thread in queue
                exceptionQueue.Enqueue(e);
            }
        });

        //throw all the exception as AggregateException after loop completes
        if (exceptionQueue.Count > 0) throw new AggregateException(exceptionQueue);
    }
}
}

Output –


As you can see in above example, all exception thrown from Parallel.For loop is captured as ArgumentException and added to ConcurrentQueue from multiple threads. If parallel loop has any exception occurred, then we need to iterate all the exception to get exception message.

References –

See also –

Monday, November 28, 2016

Parallel.For and Parallel.Foreach – Task Parallel Library


Task Parallel Library provides Parallel.For and Parallel.Foreach methods to achieve data parallelism. Data parallelism refers to scenario where you can perform operation in parallel. The Parallel.For and Parallel.foreach methods are part of System.Threading.Tasks.Parallel.

Sequential loop like below -

for(int i=0; i<10; i++)
     Console.Write("{0},",i);

Output -
0,1,2,3,4,5,6,7,8,9,


Parallel for loop –

Parallel.For(0, 10, i => Console.Write("{0},",i));

Output –
2,3,4,8,9,1,7,6,5,0,


Parallel foreach loop - 

List<int> numbers = new List<int>() {0, 1, 2, 3, 4, 5, 6, 7, 8, 9 };
Parallel.ForEach(numbers, x => Console.Write("{0},", x));

Output –
2,3,5,7,9,1,8,4,6,0,

As we all know about how sequential loop works. But when parallel loop executes, the task parallel library partitioned data into multiple parts and loop can iterate on multiple parts concurrently. Internally task scheduler of TPL partitioned data and distribute data between multiple threads if required. So you can see the output of parallel for loop is different than sequential for loop. With parallel loop, order of data is not guaranteed because task parallel library divides data in multiple parts and process it in parallel.

Parallel For and Foreach loops gives you best performance via utilizing the multiple cores of your machine. Developer can use parallel loops to leverage full potential of the available hardware. Task parallel Library internally manages code to adjust with different cores and hardware without changing it. If you running parallel loops with single core machine this will use single core to run parallel loops and takes bit more time compare to run the same on four core machine. See below image where parallel loop utilizes four cores of your machine equally to complete its operation.





How to break parallel loops in between?


We normally break sequential loops using break keyword like below.

for (int i = 0; i < 500; i++)
{
    if (i == 100)
        break;
    Console.Write("{0},", i);
}

But you can’t use break with Parallel loops. There is different way to break parallel loop on some condition. If you want to break parallel loop, then you need to use different overload methods of Parallel for and foreach where delegate accepts ParallelLoopState like below.

public static ParallelLoopResult For(int fromInclusive, int toExclusive, Action<int, ParallelLoopState> body);

See below example to break parallel loops -

Parallel.For(0, 10, (int i, ParallelLoopState loopstate) =>
{
    if (i == 5)
    {
       loopstate.Break();
return;
    }

    Console.Write("{0},", i);
});

Output –
8,4,2,6,3,0,1,

As you can see in output when variable ‘i’ value is 5, loop will break. But in parallel loops the order is not guaranteed because task parallel library divides data in multiple parts and run concurrently.

If you break loop in between means loop is not completed fully. You can check status whether loop is completed or not using ParallelLoopResult. ParallelLoopResult is return type for almost all Paralle.For and Parallel.Foreach overload methods. ParallelLoopResult can also be used to know the value of LowestBreakIteration (loop break value). See below example to check whether loop is completed or not or break in between.

ParallelLoopResult result = Parallel.For(0, 10,(int i, ParallelLoopState loopstate) =>
{
    if (i == 5)
    {
        loopstate.Break();
        return;
    }
    Console.Write(string.Format("{0},", i));
});
if (result.IsCompleted)
    Console.WriteLine("Loop fully completed.");
else if (result.LowestBreakIteration.HasValue)
    Console.WriteLine("Loop not fully completed. Broken at value {0}", result.LowestBreakIteration.Value);

Output –
2,3,4,6,1,8,0,
Loop not fully completed. Broken at value 5


I hope this article helps you to know more about Parallel.For and Paralle.Foreach methods of Parallel extension of Task Parallel Library. Please leave your feedback in comments below.


Reference –

See also –